Architectures for natural language processing

ABSTRACT

Systems are presented for generating a natural language model. The system may comprise a database module, an application program interface (API) module, a background processing module, and an applications module, each stored on the at least one memory and executable by the at least one processor. The system may be configured to generate the natural language model by: ingesting training data, generating a hierarchical data structure, selecting a plurality of documents among the training data to be annotated, generating an annotation prompt for each document configured to elicit an annotation about said document, receiving the annotation based on the annotation prompt, and generating the natural language model using an adaptive machine learning process configured to determine patterns among the annotations for how the documents in the training data are to be subdivided according to the at least two topical nodes of the hierarchical data structure.

CROSS REFERENCES TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/289,481, filed Feb. 28, 2019, and titled “ARCHITECTURES FOR NATURALLANGUAGE PROCESSING,” which is a continuation of U.S. patent applicationSer. No. 14/964,518, filed Dec. 9, 2015, and titled “ARCHITECTURES FORNATURAL LANGUAGE PROCESSING,” which claims the benefits of U.S.Provisional Application 62/089,736, filed Dec. 9, 2014, and titled,“METHODS AND SYSTEMS FOR ANNOTATING NATURAL LANGUAGE PROCESSING,” U.S.Provisional Application 62/089,742, filed Dec. 9, 2014, and titled,“METHODS AND SYSTEMS FOR IMPROVING MACHINE PERFORMANCE IN NATURALLANGUAGE PROCESSING,” U.S. Provisional Application 62/089,745, filedDec. 9, 2014, and titled, “METHODS AND SYSTEMS FOR IMPROVINGFUNCTIONALITY IN NATURAL LANGUAGE PROCESSING,” and U.S. ProvisionalApplication 62/089,747, filed Dec. 9, 2014, and titled, “METHODS ANDSYSTEMS FOR SUPPORTING NATURAL LANGUAGE PROCESSING,” the disclosures ofwhich are incorporated herein in their entireties and for all purposes.

This application is also related to US non provisional applications(Attorney Docket No. 1402805.00006_IDB006), titled “METHODS FORGENERATING NATURAL LANGUAGE PROCESSING SYSTEMS,” (Attorney Docket No.1402805.00012_IDB012), titled “OPTIMIZATION TECHNIQUES FOR ARTIFICIALINTELLIGENCE,” (Attorney Docket No. 1402805.00013_IDB013), titled“GRAPHICAL SYSTEMS AND METHODS FOR HUMAN-IN-THE-LOOP MACHINEINTELLIGENCE,” (Attorney Docket No. 1402805.00014_IDB014), titled“METHODS AND SYSTEMS FOR IMPROVING MACHINE LEARNING PERFORMANCE,”(Attorney Docket No. 1402805.000015_IDB015), titled “METHODS AND SYSTEMSFOR MODELING COMPLEX TAXONOMIES WITH NATURAL LANGUAGE UNDERSTANDING,”(Attorney Docket No. 1402805.00016_IDB016), titled “AN INTELLIGENTSYSTEM THAT DYNAMICALLY IMPROVES ITS KNOWLEDGE AND CODE-BASE FOR NATURALLANGUAGE UNDERSTANDING,” (Attorney Docket No. 1402805.00017_IDB017),titled “METHODS AND SYSTEMS FOR LANGUAGE-AGNOSTIC MACHINE LEARNING INNATURAL LANGUAGE PROCESSING USING FEATURE EXTRACTION,” (Attorney DocketNo. 1402805.00018_IDB018), titled “METHODS AND SYSTEMS FOR PROVIDINGUNIVERSAL PORTABILITY IN MACHINE LEARNING,” and (Attorney Docket No.1402805.00019_IDB019), titled “TECHNIQUES FOR COMBINING HUMAN ANDMACHINE LEARNING IN NATURAL LANGUAGE PROCESSING,” each of which werefiled concurrently with U.S. patent application Ser. No. 14/964,518, andthe entire contents and substance of all of which are herebyincorporated in total by reference in their entireties and for allpurposes.

TECHNICAL FIELD

The subject matter disclosed herein generally relates to processingdata. In some example embodiments, the present disclosures relate tosystems for generating natural language models.

BACKGROUND

It has long been a goal to program machines to process human readablelanguage, sometimes in part as an effort to generate artificialintelligence. However, programming computers to process human readablelanguage has proven to be far more difficult than imagined, particularlyas languages continue to change and evolve, and the meaning of words andphrases are more ambiguous and nuanced than assumed. A number oftechniques are available for processing natural language by computers,but the methods for generating these models either are inaccurate andimprecise or require months of refinement and programming to accuratelymodel specific subject areas of language. It is desirable therefore todevelop improved methods for generating natural language models that areaccurate and quick while also reducing human time spent generating themodels.

BRIEF SUMMARY

In some embodiments, a system for generating a natural language model ispresented. The system may include: at least one memory and at least oneprocessor communicatively coupled to the at least one memory; and adatabase module, an application program interface (API) module, abackground processing module, and an applications module, each stored onthe at least one memory and executable by the at least one processor;the API module configured to ingest training data representative ofdocuments to be analyzed by the natural language model and to store thetraining data in the database module; the background processing moduleconfigured to: generate a hierarchical data structure, the hierarchicaldata structure comprising at least two topical nodes, wherein the atleast two topical nodes represent partitions organized by two or moretopical themes among the topical content of the training data withinwhich the training data is to be subdivided into; select among thetraining data a plurality of documents to be annotated; generate atleast one annotation prompt for each document among the plurality ofdocuments to be annotated, said annotation prompt configured to elicitan annotation about said document indicating which node among the atleast two topical nodes of the hierarchal data structure said documentis to be classified into; the application module configured to causedisplay of the at least one annotation prompt for each document amongthe plurality of documents to be annotated; the API module furtherconfigured to receive for each document among the plurality of documentsto be annotated, the annotation in response to the displayed annotationprompt; and the background processing module further configured togenerate the natural language model using an adaptive machine learningprocess configured to determine, among the received annotations,patterns for how the documents in the training data are to be subdividedaccording to the at least two topical nodes of the hierarchical datastructure.

In some embodiments, the background processing module is furtherconfigured to test performance of the natural language model using asubset of the documents among the training data that receivedannotations. In some embodiments, the background processing module isfurther configured to: compute a performance metric of the naturallanguage model, based on results of the testing; and determine whetherthe natural language model satisfies at least one performance criterionbased on the computed performance metric. In some embodiments, thebackground processing module is further configured to: performing one ormore optimization techniques configured to improve performance of thenatural language platform, in response to determining that the naturallanguage platform fails to satisfy the at least one performancecriterion based on the computed performance metric. In some embodiments,the one or more optimization techniques comprises at least one of: afeature selection process, a padding and rebalancing process of thenatural language model, a pruning process of the natural language model,a feature discovery process, a smoothing process of the natural languagemodel, or a model interpolation process.

In some embodiments, the background processing module is furtherconfigured to: determine that the natural language platform fails tosatisfy the at least one performance criterion based on the computedperformance metric; in response to said determining: identify a topicalnode among the two or more topical nodes of the hierarchal datastructure that the natural language model fails to accurately categorizedocuments into; select a second plurality of documents to be annotated,the second plurality comprising documents associated with said topicalnode that the natural language model failed to accurately categorizedocuments into; and generate a second set of at least one annotationprompt for each document among the second plurality of documents to beannotated, said annotation prompt among the second set configured toelicit an annotation about said document to improve the natural languagemodel in accurately categorizing documents into said topical node;wherein the applications module is further configured to cause displayof the second set of the at least one annotation prompt for eachdocument among the second plurality of documents to be annotated;wherein the API module is further configured to receive for eachdocument among the second plurality of documents to be annotated, asecond set of annotations in response to the second set of displayedannotation prompts; and wherein the background processing module isfurther configured to generate a refined natural language model usingthe adaptive machine learning process and based on the hierarchical datastructure, the training data and the second set of annotations.

In some embodiments, generating the hierarchical data structurecomprises: performing a topic modeling process configured to identifytwo or more topics among the content of the training data that isconfigured to define the two or more topical nodes of the hierarchicaldata structure.

In some embodiments, the background processing module is furtherconfigured to access one or more rules configured to instruct thenatural language model how to categorize one or more documents into thetwo or more nodes of the hierarchical data structure. In someembodiments, generating the hierarchical data structure comprises:conducting a rules generation process configured to evaluate logicalconsistency among the one or more rules.

In some embodiments, generating the hierarchical data structurecomprises: generating at least one annotation prompt for each topicalnode among the two or more topical nodes in the hierarchical datastructure, said annotation prompt configured to elicit an annotationabout said topical node indicating a level of accuracy of placement ofthe node within the hierarchical data structure; causing display of theat least one annotation prompt for each topical node; and receiving foreach topical node, the annotation in response to the displayedannotation prompt. In some embodiments, the background processing moduleis further configured to evaluate performance of the hierarchical datastructure based on the annotations. In some embodiments, the backgroundprocessing module is further configured to determine that thehierarchical data structure fails to satisfy at least one performancecriterion in response to the evaluating; and modify a logicalrelationship among the two or more topical nodes based on theannotations and in response to determining that the data structure failsto satisfy the at least one performance criterion.

In some embodiments, the API module is further configured to receive atraining guideline based on the annotations to the nodes, the trainingguideline configured to provide instructions to an annotator foranswering one or more annotation prompts for each document among theplurality of documents to be annotated.

In some embodiments, the hierarchical data structure comprises at leasta third topical node and a fourth topical node, wherein the third andfourth topical nodes both represent sub-partitions within the topicaltheme of the first node and organized by a third and fourth topicaltheme, respectively, among the topical content of the training datawithin which the training data is to be subdivided into.

In some embodiments, a method for generating a natural language platformsystem configured to generate a natural language model is presented. Themethod includes: deriving from an analogous natural language platformsystem, parameters configured to optimize performance of said analogoussystem to generate an analogous natural language model configured toanalyze similar but not identical documents as the natural languagemodel; interpolating said parameters to be optimized for documents to beanalyzed by the natural language model; and implementing theinterpolated parameters in the natural language platform system suchthat the interpolated parameters are configured to generate the naturallanguage model.

In some embodiments, another system for generating natural languagemodels is presented. This system may include: a full service naturallanguage platform geographically located at a remote host location andconfigured to: receive training data through a network connection; traina natural language model based on the received training data; andgenerate predictions about untested data using the natural languagemodel; and a connector module geographically located at a local clienthost location and communicatively coupled to the full service naturallanguage platform through the network connection and configured to:access the training data stored in a client data store at the localclient host location; format the training data in a uniform manner;transmit the training data to the full service natural language platformthrough the network connection; receive the predictions about theuntested data; and store the predictions about the untested data in amemory at the local client host location.

In some embodiments, the system further includes a text extractionmodule communicatively coupled to the connector module and configured toattach to the client data store and extract textual data for use as thetraining data.

In some embodiments, the full service natural language platform isfurther configured to: generate a hierarchical data structure, thehierarchical data structure comprising at least two topical nodes,wherein the at least two topical nodes represent partitions organized bytwo or more topical themes among the topical content of the trainingdata within which the training data is to be subdivided into; selectamong the training data a plurality of documents to be annotated; andgenerate at least one annotation prompt for each document among theplurality of documents to be annotated, said annotation promptconfigured to elicit an annotation about said document indicating whichnode among the at least two topical nodes of the hierarchal datastructure said document is to be classified into. In some embodiments,the full service natural language platform is further configured to:cause display of the at least one annotation prompt for each documentamong the plurality of documents to be annotated; receive for eachdocument among the plurality of documents to be annotated, theannotation in response to the displayed annotation prompt; and generatethe natural language model using an adaptive machine learning processconfigured to determine, among the received annotations, patterns forhow the documents in the training data are to be subdivided according tothe at least two topical nodes of the hierarchical data structure.

In some embodiments, the full service natural language platform isfurther configured to: test performance of the natural language modelusing a subset of the documents among the training data that receivedannotations; compute a performance metric of the natural language model,based on results of the testing; determine that the natural languageplatform fails to satisfy the at least one performance criterion basedon the computed performance metric; in response to said determining:identify a topical node among the two or more topical nodes of thehierarchal data structure that the natural language model fails toaccurately categorize documents into; select a second plurality ofdocuments to be annotated, the second plurality comprising documentsassociated with said topical node that the natural language model failedto accurately categorize documents into; generate a second set of atleast one annotation prompt for each document among the second pluralityof documents to be annotated, said annotation prompt among the secondset configured to elicit an annotation about said document to improvethe natural language model in accurately categorizing documents intosaid topical node; cause display of the second set of the at least oneannotation prompt for each document among the second plurality ofdocuments to be annotated; receive for each document among the secondplurality of documents to be annotated, a second set of annotations inresponse to the second set of displayed annotation prompts; and generatea refined natural language model using the adaptive machine learningprocess and based on the hierarchical data structure, the training dataand the second set of annotations.

BRIEF DESCRIPTION OF THE DRAWINGS

Some embodiments are illustrated by way of example and not limitation inthe figures of the accompanying drawings.

FIG. 1A is a network diagram illustrating an example network environmentsuitable for aspects of the present disclosure, according to someexample embodiments.

FIG. 1B is diagram providing additional examples of networkedenvironments for generating natural language models, according to someembodiments.

FIG. 2A is a diagram showing an example system architecture forperforming aspects of the present disclosure, according to some exampleembodiments.

FIG. 2B is a diagram of example flowchart for how the systemarchitecture according to the diagram in FIG. 2A may be utilized togenerate other natural language platforms that are more tailored to aclient's specific needs, according to some embodiments.

FIG. 3 is a high level diagram showing various examples of types ofhuman communications and what the objectives may be for a naturallanguage model to accomplish, according to some embodiments.

FIG. 4 is a diagram showing an example flowchart for how different datastructures within the system architecture may be related to one another,according to some example embodiments.

FIG. 5 is a flowchart of a high-level process for generating a naturallanguage model, according to some embodiments.

FIG. 6 is a diagram of a more detailed view of the block 505 of FIG. 5,describing the ingest data phase, according to some embodiments.

FIG. 7 shows a simple example of an ontology.

FIG. 8 shows a more complex example of an ontology.

FIGS. 9A and 9B show an even more complex example of an ontology.

FIGS. 10A-10C provide additional example details for generating theontology within the block 510 of FIG. 5, according to some embodiments.

FIGS. 11A and 11B show example displays for annotating as part of theadaptive machine learning process, according to some embodiments.

FIG. 12 provides a more detailed process flow of the block 515 of FIG.5, for conducting the adaptive machine learning process using theannotations to generate the natural language model, according to someembodiments.

FIG. 13 provides further details of different examples of optimizationtechniques that may be applied in block 1240 of FIG. 12 for performingoptimization techniques to improve the natural language model, accordingto some embodiments.

FIG. 14 illustrates a modified process for generating natural languagemodels, including iterating processes between some processes within thecreate ontology block and the conduct adaptive machine learning block ofFIG. 5, according to some embodiments.

FIG. 15 provides an additional variant for generating natural languagemodels, this time including a maintenance process for updating currentlyoperating natural language models, according to some embodiments.

FIG. 16 is a block diagram illustrating components of a machine,according to some example embodiments, able to read instructions from amachine-readable medium and perform any one or more of the methodologiesdiscussed herein.

DETAILED DESCRIPTION

Example methods, apparatuses, and systems (e.g., machines) are presentedfor generating natural language models.

The modes of human communications brought upon by digital technologieshave created a deluge of information that can be difficult for humanreaders to handle alone. Companies and research groups may want todetermine trends in the human communications to determine what peoplegenerally care about for any particular topic, whether it be what carfeatures are being most expressed on Twitter®, what political topics arebeing most expressed on Facebook®, what people are saying about thecustomer's latest product in their customer feedback page, what are thekey categories written about in a large body of legal documents, and soforth. In some cases, the companies or the research groups may want todetermine what are the general topics being talked about, to begin with.It may be desirable for companies to aggregate and then synthesize thethousands or even millions of human communications from the manydifferent modes available in the digital age (e.g., Twitter®,transcribed speech, email, OCR-ed text, etc.) to determine theseobjectives. Processing all this information by humans alone can beoverwhelming and cost-inefficient. Methods today may therefore rely oncomputers to apply natural language processing in order to analyze themany human communications available in order to filter, organize,interpret, categorize, ask questions, respond to questions, and extractinformation from the many human communications into digestible patternsof communication.

Conventionally, natural language processing may involve humans declaringa large number of rules for the natural language model to follow. Forexample, when trying to determine how many communications conveynegative sentiment about a politician, a rule may be generated to searchfor “horrible” and “Bill Clinton” in the same document. Similarly, arule looking for the terms “wonderful” and “Bill Clinton” may attempt tofind all the communications conveying positive sentiment about apolitician. Conventionally, the natural language model may then parse agiven number of documents, follow the rules, and attempt to categorizethe documents by the rules. However, it can become clear that the numberof rules necessary to capture all forms of negative and positivesentiment may be immeasurable, as there alone are simply dozens ofsynonyms for the words “bad” and “good,” not to mention all the slangterms, and sentiment conveyed by a combination of words that may not becaptured by any singular rule. Generally, performing natural languageprocessing through a rules-only approach oftentimes fails to adequatelyand reliably categorize human communications in a meaningful way.

Alternately, natural language processing may involve applyingmachine-learning techniques to a set of example documents that have beencategorized by human annotators, generating statistical inferences forthe natural language model to follow. For example, a collection ofdocuments written about “Bill Clinton” may include a subset of thosedocuments understood and categorized by a human as conveying negativesentiment, and a second subset of documents understood and categorizedas conveying positive sentiment. In this way, the natural languagemodels generated through machine learning approaches are able torecognize many more characteristics of sentiment-conveying text than canbe reasonably implemented through rules-only approaches. Accordingly,the accuracy of such models at categorizing human communications may besubstantially higher than rules-based natural language processingapproaches. However, conventionally, the use of machine learning forperforming natural language processing requires human annotators tolabel hundreds or even thousands of example documents before the naturallanguage models achieve a satisfactory level of accuracy. Therefore, itis desirable to use more effective natural language processingtechniques to rapidly develop natural language models that efficientlyand accurately summarize the thousands or millions of humancommunications.

Aspects of the present disclosure are presented for generating naturallanguage models (also referred to herein as prediction engines) througha combination of processing human annotations of documents and adaptivemachine learning techniques that determine patterns in the documents,based in part on the human annotations applied to the documents. Thenatural language platform used to generate the models, such as asoftware system stored in memory of one or more servers communicativelycoupled in parallel, may be configured to dynamically generate humanreadable annotation prompts that may be expressed in varying levels ofdetail and granularity so as to efficiently obtain specific annotationinformation and maximize use of a human annotator's time. An iterativeor cyclic process, involving cycling between supplying prompts for humanannotations to documents and obtaining the annotations, then generatingthe natural language model using machine learning of the documents withthe annotations, then evaluating the model and iterating back toobtaining more human annotations to refine the model, and so forth, isalso presented.

As an example, the human annotations may assist the machine learningtechniques to resolve inevitable ambiguities in the humancommunications, as well as provide intelligence or meaning tocommunications that the machine does not accurately comprehend a priori.The human annotations can then enable a natural language platformconfigured to generate these natural language models to provide betternatural language results of the human communications, which can then inturn be better refined with the assistance of more human annotations asnecessary. For example, a human annotator may provide an annotationabout a Tweet reading “Can you believe this guy?!” as expressingnegative sentiment of the person being referenced, while a rule may failto flag this Tweet as having negative sentiment because no typicalnegative words are contained in the Tweet.

In some embodiments, the natural language platform used to generate thenatural language models may be configured to generate human readableannotation prompts at varying degrees of detail, so as to elicitresponses from human annotators that match more specifically a level ofambiguity the natural language platform is trying to resolve. Forexample, a first level of annotation prompts may include determiningwhat subject category (or categories) a document may be categorizedinto, among a plurality of first categories. This first level maytherefore generate a prompt to the human annotator that includes amultiple choice question including three or more categories that thehuman annotator may choose from, where the human annotator may beprompted to select among any or all of the categories. As anotherexample, a second level of annotation prompts may include determiningwhat subject category a document may be best categorized into, among aplurality of second categories. This second level may therefore generatea prompt displayable to the human annotator that includes amultiple-choice question including three or more categories that thehuman annotator may choose from, where the human annotator may beprompted to select only one category that best describes the document.As another example, a third level of annotation prompts may includesupplying a binary decision to the human annotator, such as askingwhether it fits into a category or not (e.g., a yes or no or a true orfalse question).

Depending on the amount and precision of information that the naturallanguage model has already obtained, the natural language platform maydynamically adjust the type of annotation prompts to supply to humanannotators. For example, if the natural language model possesses littleinformation about a document or a category of documents, the naturallanguage platform may be configured to supply a more open ended questionto human annotators, such as an annotation prompt at the first level. Onthe other hand, if the natural language model possesses preciseinformation about a document or a category of documents, the naturallanguage platform may be configured to supply a more precise question tothe human annotators, such as an annotation prompt at the third level.In this way, information to improve the natural language model may bemore efficiently obtained and may maximize the use of human annotators'time.

In addition, aspects of the present disclosure may construct naturallanguage models based on this iterative process that can be specificallytailored to a customer's unique needs or subject matter area. Forexample, the words important to categorizing communications inbiotechnology may be different than the words important to categorizingcommunications in the automobile industry. The biotechnology user maydesire to tailor a natural language model to better understand articlesrelated to biotechnology, while the automobile industry user may desireto tailor the same or a different natural language model to betterunderstand customer feedback emails. As another example, the language,grammar, and idioms used in social media may vary drastically fromcommunications in professional writings, e.g., legal or medicaljournals. A user focusing on Twitter® communications may desire totailor the natural language model to better determine when tweets ofadolescent teens convey positive sentiment or negative sentiment, whilea user focusing on legal documents may desire to tailor a naturallanguage model to better understand whether a legal decision isfavorable or unfavorable, and to see how many scholarly articles arewritten about the legal decision and what percentage of them arecritical of the legal decision. Aspects of the present disclosuretherefore are robust enough to generate a natural language model tointerpret documents in any language (or languages) and for any number oftopics or subject areas contained within the documents.

The natural language generation platform may be configured to processthe human annotations to any of these areas in order to better informthe natural language model how to understand these complexcommunications. The natural language model can be trained through theiterative process utilizing human annotations to more easily determinehow to categorize any of these diverse areas of human communications.

In some embodiments, the various categories or topics, sometimesreferred to herein as labels, that the documents are intended to begrouped into may be organized by a hierarchical data structure, referredto herein as an ontology. The hierarchical structure design of theontology may define a plurality of subcategories, or sub labels, thatthematically or topically fit within a larger umbrella label. Forexample, a set of documents about customer service emails of atelecommunications company may be categorized into an ontology where afirst label of the ontology may be simply “customer service.” Under thisfirst label, multiple sub labels may be included that thematically fallwithin the broader label of customer service, such as “Internet,”“phone,” and “customer service experience.” There may be further sublabels, such as “Check the router power,” which could be specificresponses to questions in the customer service exchange, representing asingle intention type of the person emailing the customer servicerepresentative: in this example, the intention to get solutions for abroken internet connection. In general, an ontology may include multiplelevels of sub labels under higher levels of sub labels, and so forth.Similarly, an ontology may include multiple labels which define aplurality of common subcategories, that thematically or topically fitwithin each larger umbrella. In its totality, the ontology can representthe multiple label, information extracted, and sequence of responses toquestions, for an entire business unit and organization.

In some embodiments, these labels of the ontology may define and providestructure for the categorization of millions of documents by the naturallanguage model. That is, developing the natural language model willinclude a goal of training the natural language model to accuratelyclassify each of the documents in a set of documents into one or more ofthe labels in the hierarchical data structure. In some embodiments, aprocess for generating the natural language model includes a subprocessfor first generating this ontology. Once generated, the ontology may beused to guide the aforementioned iterative process of obtaining humanannotations and applying them in adaptive machine learning processes togenerate a natural language model.

Once tuned to a user's specific needs through the aforementionediterative process, the natural language model can then be applied tountested human communications, and may generate an output thatsummarizes which human communications and how many are categorized intothe various labels in the ontology structure (e.g., positive sentimentof a politician, negative sentiment of a politician). The processconducted through the natural language model for classifying thethousands or even millions of documents into the specified topics (alsoreferred to herein as labels) may be referred to herein as adocument-scope task.

In some embodiments, a natural language model of the present disclosuresmay also be configured to extract specific subsets of text fromdocuments (e.g., extract every phrase mentioning an anticipated cost orsavings in a document discussing tax reform, or extract all words namingthe crew in one or more documents relating to the latest blockbustermovie). This process conducted through the natural language model forextracting specific types of text from one or more documents may bereferred to herein as a span-scope task. In general, as used herein, aspan may refer to a subset of words or characters in a document, such asa paragraph, multiple paragraphs, sentences, or a plurality of words.Performing a span scope task may also include first generating anontology, and then conducting the iterative human annotation andadaptive machine learning process, in some embodiments.

In some embodiments, a natural language platform of the presentdisclosures may also be configured to simply identify or discover whatcategories or subject areas (e.g., labels) the documents may bethematically organized into. For example, it may not be known what arethe top twenty subject areas being most discussed in all tweets withinthe first week of September 2015. Before a natural language model can begenerated to summarize what sentiment is being conveyed about these toptwenty most discussed subject areas, it should be known what kinds ofsubject matters these tweets generally discuss in the first place. Theprocess for discovering what topics or subject areas the documents maybe thematically organized into may be referred to as topic modeling.

In some embodiments, a natural language platform of the presentdisclosures may also be configured to utilize rules in combination withthe iterative human annotation/machine learning process. In someembodiments, the rules may be imported from an existing body specifiedby a user or customer. In addition, the natural language model may begenerated with the aid of any pre-existing conditions specified by theuser, such as specified topics or rules, and embodiments are not solimited.

In some embodiments, a natural language model may be generated usingpreviously generated natural language models that were tailored todifferent but similar subject areas. In addition, the natural languagemodel may be refined or updated through a maintenance process so thatthe model may be better suited to handle new data with an evolving bodyof words. For example, a natural language model for processing tweetsmay be refined over time to account for new slang or idioms that werenot previously used in the common vernacular.

Examples merely demonstrate possible variations. Unless explicitlystated otherwise, components and functions are optional and may becombined or subdivided, and operations may vary in sequence or becombined or subdivided. In the following description, for purposes ofexplanation, numerous specific details are set forth to provide athorough understanding of example embodiments. It will be evident to oneskilled in the art, however, that the present subject matter may bepracticed without these specific details.

Referring to FIG. 1A, a network diagram illustrating an example networkenvironment 100 suitable for performing aspects of the presentdisclosure is shown, according to some example embodiments. The examplenetwork environment 100 includes a server machine 110, a database 115, afirst device 120 for a first user 122, and a second device 130 for asecond user 132, all communicatively coupled to each other via a network190. The server machine 110 may form all or part of a network-basedsystem 105 (e.g., a cloud-based server system configured to provide oneor more services to the first and second devices 120 and 130). Theserver machine 110, the first device 120, and the second device 130 mayeach be implemented in a computer system, in whole or in part, asdescribed below with respect to FIG. 16. The network-based system 105may be an example of a natural language platform configured to generatenatural language models as described herein. The server machine 110 andthe database 115 may be components of the natural language platformconfigured to perform these functions. While the server machine 110 isrepresented as just a single machine and the database 115 where isrepresented as just a single database, in some embodiments, multipleserver machines and multiple databases communicatively coupled inparallel or in serial may be utilized, and embodiments are not solimited.

Also shown in FIG. 1A are a first user 122 and a second user 132. One orboth of the first and second users 122 and 132 may be a human user, amachine user (e.g., a computer configured by a software program tointeract with the first device 120), or any suitable combination thereof(e.g., a human assisted by a machine or a machine supervised by ahuman). The first user 122 may be associated with the first device 120and may be a user of the first device 120. For example, the first device120 may be a desktop computer, a vehicle computer, a tablet computer, anavigational device, a portable media device, a smartphone, or awearable device (e.g., a smart watch or smart glasses) belonging to thefirst user 122. Likewise, the second user 132 may be associated with thesecond device 130. As an example, the second device 130 may be a desktopcomputer, a vehicle computer, a tablet computer, a navigational device,a portable media device, a smartphone, or a wearable device (e.g., asmart watch or smart glasses) belonging to the second user 132. Thefirst user 122 and a second user 132 may be examples of users orcustomers interfacing with the network-based system 105 to utilize anatural language model according to their specific needs. In othercases, the users 122 and 132 may be examples of annotators who aresupplying annotations to documents to be used for training purposes whendeveloping a natural language model. In other cases, the users 122 and132 may be examples of analysts who are providing inputs to the naturallanguage platform to more efficiently train the natural language model.The users 122 and 132 may interface with the network-based system 105through the devices 120 and 130, respectively.

Any of the machines, databases 115, or first or second devices 120 or130 shown in FIG. 1A may be implemented in a general-purpose computermodified (e.g., configured or programmed) by software (e.g., one or moresoftware modules) to be a special-purpose computer to perform one ormore of the functions described herein for that machine, database 115,or first or second device 120 or 130. For example, a computer systemable to implement any one or more of the methodologies described hereinis discussed below with respect to FIG. 16. As used herein, a “database”may refer to a data storage resource and may store data structured as atext file, a table, a spreadsheet, a relational database (e.g., anobject-relational database), a triple store, a hierarchical data store,any other suitable means for organizing and storing data or any suitablecombination thereof. Moreover, any two or more of the machines,databases, or devices illustrated in FIG. 1A may be combined into asingle machine, and the functions described herein for any singlemachine, database, or device may be subdivided among multiple machines,databases, or devices.

The network 190 may be any network that enables communication between oramong machines, databases 115, and devices (e.g., the server machine 110and the first device 120). Accordingly, the network 190 may be a wirednetwork, a wireless network (e.g., a mobile or cellular network), or anysuitable combination thereof. The network 190 may include one or moreportions that constitute a private network, a public network (e.g., theInternet), or any suitable combination thereof. Accordingly, the network190 may include, for example, one or more portions that incorporate alocal area network (LAN), a wide area network (WAN), the Internet, amobile telephone network (e.g., a cellular network), a wired telephonenetwork (e.g., a plain old telephone system (POTS) network), a wirelessdata network (e.g., WiFi network or WiMax network), or any suitablecombination thereof. Any one or more portions of the network 190 maycommunicate information via a transmission medium. As used herein,“transmission medium” may refer to any intangible (e.g., transitory)medium that is capable of communicating (e.g., transmitting)instructions for execution by a machine (e.g., by one or more processorsof such a machine), and can include digital or analog communicationsignals or other intangible media to facilitate communication of suchsoftware.

Referring to FIG. 1B, diagram 150 provides additional examples ofnetworked environments for generating natural language models, accordingto some embodiments. Diagram 150 provides a block diagram view ofvarious configurations of a natural language platform configured togenerate natural language models may interact with various datastructures of a client, such as a user 120 or 130. In diagram 150, threeoptions are illustrated. According to any of the three options, a flowof data transfer for generating natural language models begins at aclient data store 155. The client data store 155 may contain arepository of data that is used to train and ultimately generate thenatural language model. In some embodiments, a text extraction module157 may be installed or communicatively coupled to the client data store155 and may be configured to extract textual data. For example, theclient data store 155 may contain scanned copies of articles or printedtext. The text extraction module 157 may be configured to convert thescanned copies into computer readable text. The textual data may betransmitted to a connector 160, which may include software and hardwareconfigured to act as a bridge that interfaces between the client datastore 155 and one or more natural language platforms.

According to Option #1, from the connector 160, the textual data may betransmitted to a full-service natural language platform 170 through thenetwork 190. The full-service natural language platform 170 may belocated geographically distinct from the client, such as in a cloudenvironment. The full-service natural language platform 170 may beconfigured to provide all available services for generating naturallanguage models described herein, the various services of which will bedescribed more fully below. At the least, the full-service naturallanguage platform 170 may be configured to generate a natural languagemodel for the client by training the model on the textual data derivedfrom the client data store 155. As shown, this process may occur in thecloud environment. In addition, the full-service natural languageplatform 170 may also be configured to apply the trained naturallanguage model on untested, “live” data, referred to herein as providingpredictions on the untrained and untested data. The untrained oruntested data may also be obtained from the client, such as from theclient data store 155. As shown, this process also may occur in thecloud environment. The results of these predictions may be transmittedback to the client, through the network 190, bridged by the connector160, and ultimately stored in a results repository 185.

According to Option #2, from the connector 160, the textual data may betransmitted to a client-controlled natural language platform 175. Theclient control platform 175 may be stored locally and within theexclusive control of the client, according to some embodiments. In thisoption, the client -controlled platform 175 may be configured to deploya pre-generated natural language model, meaning the natural languagemodel was already trained with previous data. The arrow from theconnector 160 to the client-controlled platform 175 lists only“prediction,” meaning that the ability to train a natural language modelis not available in this option. However, as shown, the client is ableto utilize the pre-generated natural language model to make predictionson the untrained and untested data. These results are supplied to theconnector 160, and ultimately stored in the results repository 185.According to this option, the client may have exclusive control of thenatural language model, and is not required to release its data to anoff-site platform in order to utilize any natural language model. Thus,Option #2 allows for greater privacy for the client.

According to Option #3, from the controller 160, the textual data may betransmitted to a client controlled natural language platform 180 andthat provides full services. In this option, the full-service naturallanguage platform 170 may be duplicated locally at the client site. Thisoption may be more costly to the client, but may be more comprehensive.The services provided through the cloud 190 at the full-service naturallanguage platform 170 may be substantially identical at the localfull-service natural language platform 180, according to someembodiments.

Referring to FIG. 2A, a diagram 200 is presented showing an examplesystem architecture for performing aspects of the present disclosure,according to some example embodiments. The example system architectureaccording to diagram 200 represents various data structures and theirinterrelationships that may comprise a natural language platform, suchas the natural language platform 170, or the network-based system 105.These various data structures may be implemented through a combinationof hardware and software, the details of which may be apparent to thosewith skill in the art based on the descriptions of the various datastructures described herein. For example, an API module 205 includes oneor more API processors, where multiple API processors may be connectedin parallel. In some example embodiments, the repeating boxes in thediagram 200 represent identical servers or machines, to signify that thesystem architecture in diagram 200 may be scalable to an arbitrarydegree. The API module 205 may represent a point of contact for multipleother modules, includes a database module 210, a cache module 215,background processes module 220, applications module 225, and even aninterface for users 235 in some example embodiments. The API module 205may be configured to receive or access data from database module 210.The data may include digital forms of thousands or millions of humancommunications. The cache module 215 may store in more accessible memoryvarious information from the database module 210 or from users 235 orother subscribers. Because the database module 210 and cache module 215show accessibility through API module 205, the API module 205 can alsosupport authentication and authorization of the data in these modules.The background module 220 may be configured to perform a number ofbackground processes for aiding natural language processingfunctionality. Various examples of the background processes include amodel training module, a cross validation module, an intelligent queuingmodule, a model prediction module, a topic modeling module, anannotation aggregation module, an annotation validation module. Thesevarious modules are described in more detail below as well as innon-provisional applications (Attorney Docket Nos. 1402805.00012_IDB012,1402805.00013_IDB013, 1402805.00014_IDB014, 1402805.00016_IDB016,1402805.00017_IDB017, and 1402805.00019_IDB019), each of which are againincorporated by reference in their entireties. The API module 205 mayalso be configured to support display and functionality of one or moreapplications in applications module 225.

In some example embodiments, the users 235 may access the API module205, in some cases enabling the users 235 to create their ownapplications using the system architecture of diagram 200. The users 235may be other examples of the users 120 or 130, and may also includeproject managers and analysts. Project managers may utilize the naturallanguage platform to direct the overall construction of one or morenatural language models. Analysts may utilize the natural languageplatform to provide expert analysis and annotations to more efficientlytrain a natural language model. Also, annotators 230 may have access toapplications already created in applications module 225. Furtherdiscussion about the annotators is found more below.

In some embodiments, the system architecture according to diagram 200may be scalable and reproducible at various client sites, in some casesconsistent with the descriptions of FIG. 1B. Thus, the database modules210 and the cache module 215 may be implemented specifically for eachclient, such that each client does not share memory capacity withanother client to ensure better privacy.

Referring to FIG. 2B, diagram 250 provides an example flowchart for howthe system architecture according to diagram 200 may be utilized togenerate other natural language platforms that are more tailored to aclient's specific needs, according to some embodiments. For example, apre-existing natural language platform according to the diagram 200 mayhave already been constructed at the client site. The various APImodules 205, database modules 210, cache modules 215, background modules220, and display modules 225 of the pre-existing natural languageplatform may have been constructed and tailored to specific client needsand/or specific subject matter areas. For example, the pre-existingnatural language platform may have been constructed to generate one ormore natural language models related to comments on Twitter aboutbaseball. When generating a new natural language platform for similarsubject matter, it may be desirable to start with the pre-existingnatural language platform. For example, it may be desirable to base anew natural language platform for a client that desires to process textfrom Twitter related to all professional sports on the pre-existingnatural language platform configured to generate natural language modelsabout baseball.

In this respect, still referring to diagram 250, in some embodiments, aninterpolation operation 255 may be applied to the pre-existing naturallanguage platform. The interpolation operation 255 may include copyingvarious modules, data and labels from the pre-existing natural languageplatform over to the new natural language platform. For example, thevarious background modules 220 may be calibrated to more efficientlyhandle text from Twitter, knowing the general size of tweets and havingvarious libraries for understanding contemporary vernacular in tweets.The cache sizes and other various settings in the cache modules 215 mayalso be attuned to efficiently handle tweets, as another example. Thesame settings may be applied to the new natural language platform as astarting point. At block 260, the new natural language platform may begenerated, and may include various adjustments from the pre-existingnatural language platform. As yet another example, the servercomputation resources assigned to the various background modules 220 maybe adjusted in relation to the API modules 205, so that more computationresources are available for intelligent queuing.

In some embodiments, the natural language platform may be implemented bypartitioning each of the API modules 205, application modules 225,background processing modules 220, cache modules 215 and databasemodules 210 to be implemented within individual containers. In suchembodiments, platform interpolation may be implemented by adjusting theproportion of processing resources (e.g., memory, network, storage andcomputation) assigned to each such container. Other components that maybe interpolated from pre-existing and analogous platforms may includecomputational resources (CPU performance, memory) reserved for thebackground processing modules relative to the API module. If theinterpolated system will be used primarily for prediction, rather thanfor topic modeling or re-training, then a greater share of CPU resourcesmay be allocated to the API module. Another example includes selectingan optimized default text encoding for the language(s) used in theinterpolated systems. For example, if the interpolated system will onlyprocess Western European characters (no emoji, Cyrillic, or East Asianlanguages, et al.), then using ISO-8859-1 instead of UTF-8 can improveperformance and reduce storage cost.

Referring to FIG. 3, a high level diagram 300 is presented showingvarious examples of types of human communications and what theobjectives may be for a natural language model to accomplish. That is,while FIGS. 1A-2B have described the various structures of a naturallanguage platform configured to generate one or more natural languagemodels, FIG. 3 and the following figures will now discuss the variousexample functional objectives to be accomplished and various exampletechnical processes used to achieve these functional objectives. Here,various sources of data, sometimes referred to as a collection ofdocuments 305, may be obtained and stored in, for example database 115,client data store 155, or database modules 210, and may representdifferent types of human communications, all capable of being analyzedby a natural language model. Examples of the types of documents 305include, but are not limited to, posts in social media, emails or otherwritings for customer feedback, pieces of or whole journalisticarticles, commands spoken or written to electronic devices, and piecesof or whole scholarly texts. Other types include transcribed call centerrecordings, electronic (instant) messages, corporate communications(e.g., SEC 10-k, 10-q), confidential documents and communications storedon internal collaboration systems (e.g., SharePoint, Notes), andquestions and answers, either in isolation or as part of longerdialogues.

In some embodiments, at block 310, it may be desired to classify any ofthe documents 305 into a number of enumerated categories or topics,consistent with some of the descriptions mentioned above. This may bereferred to as performing a document-scope task. For example, a user 130in telecommunications may supply thousands of customer service emailsrelated to services provided by a telecommunications company. The user130 may desire to have a natural language model generated thatclassifies the emails into predetermined categories, such as negativesentiment about their Internet service, positive sentiment about theirInternet service, negative sentiment about their cable service, andpositive sentiment about their cable service. As previously mentioned,these various categories for which a natural language model may classifythe emails into, e.g. “negative” sentiment about “Internet service,”“positive” sentiment about “Internet service,” “negative” sentimentabout “cable service,” etc., may be referred to as “labels.” Based onthese objectives, at block 315, a natural language model may begenerated that is tailored to classify these types of emails into thesetypes of labels.

As another example, in some embodiments, at block 320, it may be desiredto extract specific subsets of text from documents, consistent with someof the descriptions mentioned above. This may be another example ofperforming a span-scope task, in reference to the fact that thisfunction focuses on a subset within each document (as previouslymentioned, referred to herein as a “span”). For example, a user 130 maydesire to identify all instances of a keyword, key phrase, or generalsubject matter within a novel. Certainly, this span scope task may beapplied to multiple novels or other documents. Here too, based on thisobjective, at block 315, a natural language model may be generated thatis tailored to perform this function for a specified number ofdocuments.

As another example, in some embodiments, at block 325, it may be desiredto discover what categories the documents may be thematically ortopically organized into in the first place, consistent withdescriptions above about topic modeling. In some cases, the user 130 mayutilize the natural language platform only to perform topic modeling andto discover what topics are most discussed in a specified collection ofdocuments 305. To this end, the natural language platform may beconfigured to conduct topic modeling analysis at block 330. Topicmodeling is discussed in more detail below, as well as in applications(Attorney Docket Nos. 1402805.00012_IDB012, 1402805.00013_IDB013,1402805.00015_IDB015, and 1402805.00019_IDB019), each of which are againincorporated herein by reference in their entireties. In some cases, itmay be desired to then generate a natural language model thatcategorizes the documents 305 into these newfound topics. Thus, afterperforming the topic modeling analysis 230, in some embodiments, thenatural language model may also be generated at block 315.

Referring to FIG. 4, a diagram 400 is presented showing an exampleflowchart for how different data structures within the systemarchitecture may be related to one another, according to some exampleembodiments. Here, the collections data structure 410 represents a setof documents 435 that in some cases may generally be homogenous. Adocument 435 represents a human communication expressed in a singlediscrete package, such as a single tweet, a webpage, a chapter of abook, a command to a device, or a journal article, or any part thereof.Each collection 410 may have one or more tasks 430 associated with it. Atask 430 may be thought of as a classification scheme. For example, acollection 410 of tweets may be classified by its sentiment, e.g. apositive sentiment or a negative sentiment, where each classificationconstitutes a task 430 about a collection 410. A label 445 refers to aspecific prediction about a specific classification. For example, alabel 445 may be the “positive sentiment” of a human communication, orthe “negative sentiment” of a human communication. In some cases, labels445 can be applied to merely portions of documents 435, such asparagraphs in an article or particular names or places mentioned in adocument 435. For example, a label 445 may be a “positive opinion”expressed about a product mentioned in a human communication, or a“negative opinion” expressed about a product mentioned in a humancommunication. In some example embodiments, a task may be a sub-task ofanother task, allowing for a hierarchy or complex network of tasks. Forexample, if a task has a label of “positive opinion,” there might besub-tasks for types of “positives opinions,” like “intention to purchasethe product,” “positive review,” “recommendation to friend,” and so on,and there may be subtasks that capture other relevant information, suchas “positive features.”

Annotations 440 refer to classifications imputed onto a collection 410or a document 435 by an external source, such as human input,interpolation from an existing natural language model, or application ofan external transformation process on the metadata included with thedocument (e.g., customer value, geographic location, etc.). As anexample, an annotation 440 applies a label 445 manually to a document435. In other cases, annotations 440 are provided by users 235 frompre-existing data. In other cases, annotations 440 may be derived fromhuman critiques of one or more documents 435, where the computerdetermines what annotation 440 should be placed on a document 435 (orcollection 410) based on the human critique. In other cases, with enoughdata in a language model, annotations 440 of a collection 410 can bederived from one or more patterns of pre-existing annotations found inthe collection 410 or a similar collection 410.

In some example embodiments, features 450 refer to a library orcollection of certain key words or groups of words that may be used todetermine whether a task 430 should be associated with a collection 410or document 435. Thus, each task 430 has associated with it one or morefeatures 450 that help define the task 430. In some example embodiments,features 450 can also include a length of words or other linguisticdescriptions about the language structure of a document 435, in order todefine the task 430. For example, classifying a document 435 as being alegal document may be based on determining if the document 435 containsa threshold number of words with particularly long lengths, wordsbelonging to a pre-defined dictionary of legal-terms, or words that arerelated through syntactic structures and semantic relationships. In someexample embodiments, features 450 are defined by code, while in othercases features 450 are discovered by statistical methods. In someexample embodiments, features 450 are treated independently, while inother cases features 450 are networked combinations of simpler featuresthat are used in combination utilizing techniques like “deep-learning”or “topic modeling.” In some example embodiments, combinations of themethods described herein may be used to define the features 450, andembodiments are not so limited. One or more processors may be used toidentify in a document 435 the words found in features data structure450 to determine what task should be associated with the document 435.

In some example embodiments, a work unit's data structure 455 specifieswhen humans should be tasked to further examine a document 425. Thus,human annotations may be applied to a document 435 after one or morework units 455 is applied to the document 435. The work units 455 mayspecify how many human annotators should examine the document 435 and inwhat order of documents should document 435 be examined. In some exampleembodiments, work units 455 may also determine what annotations shouldbe reviewed in a particular document 435 and what the optimal userinterface should be for review.

In some example embodiments, the data structures 405, 415, 420 and 425represent data groupings related to user authentication and user accessto data in system architecture. For example, the subscribers block 405may represent users and associated identification information about theusers. The subscribers 405 may have associated API keys 415, which mayrepresent one or more authentication data structures used toauthenticate subscribers and provide access to the collections 410.Groups 420 may represent a grouping of subscribers based on one or morecommon traits, such as subscribers 405 belonging to the same company.Individual users 425 capable of accessing the collections 410 may alsoresult from one or more groups 420. In addition, in some cases, eachgroup 420, user 425, or subscriber 405 may have associated with it amore personalized or customized set of collections 510, documents 435,annotations 440, tasks, 430, features 450, and labels 445, based on thespecific needs of the customer.

Referring to FIG. 5, flowchart 500 provides a high-level process forgenerating a natural language model, according to some embodiments. Thehigh-level process described in FIG. 5 may be implemented by a naturallanguage platform, various examples of which have been described herein.Starting at block 505, a natural language platform may be configured toingest training data representative of documents that will ultimately beanalyzed by the natural language model once the natural language modelis fully trained. Examples of training data may include exactly thetypes of documents that are intended to be analyzed by the naturallanguage model, including any of the example collections of documents305. In addition, subsets of documents may be used for training data,such as paragraphs, excerpts, chapters, or articles of larger bodies oftext. The training data may be obtained from a number of differentsources, including for example, from databases controlled by the clientlike the client data store 155. In other cases, the training data may beobtained from public sources or from data sets used to derive previouslytrained natural language models. Example processes describing furtherdetails of how the natural language platform may be configured to ingestthe training data will be described in more detail below.

At block 510, the natural language platform may then be configured tocreate a hierarchical data structure used to organize the collection ofdocuments in question, according to some embodiments. As previouslymentioned, this hierarchical data structure may be referred to as anontology. The ontology may include one or more labels that the documentsare to be categorized into. The one or more labels may be organized intoa hierarchical structure, based on a logical relationship between thelabels. For example, if one label may be thought of as having a logicalsubset of one or more additional labels, then the one or more additionallabels may be represented as a sub node under the one label. Examples ofthis hierarchical tree structure will be described in more detail below.

In some embodiments, the ontology may be created based on labelsgenerated or discovered through the topic modeling process mentionedabove. For example, if a client desires to generate a natural languagemodel for performing the task of classification analysis of a randomcollection of tweets generated within the last two weeks of a calendaryear, and the client wants to classify the collection of tweets into thetop 10 most discussed topics, topic modeling may first be performed onthe collection of tweets to discover precisely what these top 10 topicsare. These top 10 topics may then be organized in a hierarchicalstructure of an ontology, and may include some of the topics as subnodes under other topics, based on a logical or thematic relationship.For example, Christmas may be a very often discussed topic within thelast two weeks of the calendar year, and within the general topic of“Christmas,” there may be discussed “presents” and “Santa Claus.” Thus,the topics (also called labels) of “presents” and “Santa Claus” may beorganized in the ontology as sub nodes of the “Christmas” label.

In some embodiments, the ontology may also be created based on directinputs from the client, such as manual inputs for labels the clientknows or intends should be included in the ontology, as well aspreviously generated ontology's to be imported by the client. Inaddition, the ontology may be created through an iterative verificationprocess involving receiving annotations from expert annotators or otheranalysts. Additional details of these variants will be described morebelow.

At block 515, after an ontology has been established, the naturallanguage platform may be configured to conduct an adaptive machinelearning process to generate the natural language model. This adaptivemachine learning process may train the natural language model on how tocategorize the training data into the various labels as structured inthe ontology. In some embodiments, the natural language platform may beconfigured to conduct an iterative process including generating andreceiving human annotations about the documents and then training themodel based on these annotations. In some embodiments, the iterativeprocess may be repeated based on a determination of whether the trainedmodel meets various performance criteria. If the performance criteriaare not satisfied, in some embodiments, the natural language platformmay also be configured to conduct one or more optimization techniquesbefore repeating the process of generating and receiving new humanannotations and then training the model based on these annotations.Further details for conducting this adaptive machine learning processare described in more detail below.

At block 520, once the training process is complete and the naturallanguage model has been sufficiently trained (e.g., the trained modelsatisfies one or more threshold performance criteria), then the naturallanguage model may be utilized by the natural language platform toanalyze untested data as specified by the user or client. As referred toin previous figures, such as FIG. 1B, this phase of the process may bereferred to as predicting or generating predictions, and the utilizationof the natural language model at this stage may be consistent with thediscussion of predictions as described in FIG. 1B. An example output foranalyzing the untested data using the natural language model may includeorganizing the documents into the various labels and sub labels asspecified in the ontology structure, or tagging individual documentswith all relevant labels and sub-labels from the ontology. In someembodiments, various graphics in one or more graphical user interfacesmay also be included and may being configured to graphically depict howthe documents are organized within the ontology structure. Examples ofthese displays are described in application (Attorney Docket No.1402805.00013_IDB013), which is again incorporated herein by referencein its entirety.

Referring to FIG. 6, diagram 600 provides a more detailed view of theblock 505 of FIG. 5, describing the ingest data phase, according to someembodiments. For example, at block 605, the natural language platformmay be configured to ingest data related to one or more pre-existingontologies. The data of the pre-existing ontologies may be used to helpbuild the ontology for the present natural language model, while inother cases one of the pre-existing ontologies may be used exactly asthe intended ontology for the present natural language model. Thesepre-existing ontologies may be obtained from the user or the client, forexample, when the client has already performed some work in building thenatural language model previously.

At block 610, the natural language platform may also be configured toingest existing classification rules for use in training the naturallanguage model, according to some embodiments. As previously mentioned,relying on numerous rules has been a classic way for training naturallanguage models. While aspects of the present disclosure do not relyexclusively on utilizing rules, or even utilizing any rules at all, insome embodiments, the inclusion of at least some rules may still behelpful in training the natural language model. In some cases, theexisting classification rules may be obtained from the user or theclient, for example, when the client has already performed some work inbuilding the natural language model previously. In some cases, theincluded rules may represent absolute conditions, such as alwaysclassifying a document when the document contains a specific word orphrase. In other cases, the included rules may represent probabilisticconditions that may be weighted against other probabilistic factors. Forexample, the rule may assign a weight to the document to be classifiedas a certain label, where said weight may be factored into an overallcomposite score including other weights based on other factors.

At block 615, the natural language platform may also be configured toingest test data in a specified data format. An example of ingesting thetest data may include the example processes for obtaining data from theclient data store 155 in FIG. 1B. Examples of a specified data formatmay include the test data being converted into all text, or the testdata being converted into another uniform format. In some embodiments,the test data may be first ingested in its various formats, and thenwithin the natural language platform, additional processes may convertthe data into the uniform format. In other cases, the test data may befirst converted into a uniform format, for example, at the client level,which is then ingested by the natural language platform.

At block 620, the natural language platform may also be configured toperform various normalization and filtering procedures on the test data.For example, the documents converted into all text may then be broken upinto individual units of language, referred to as tokens. Examples oftokens can include single words, character sequences, punctuation marks,and spaces. As another example, related and semantically-equivalenttokens may be converted into canonical forms. Examples of such canonicalnormalizations include decomposing character sequencing containingdiacriticals (e.g., decomposing “A” to “A” and a combining grave accent)and jamo, expanding abbreviations into the original word (e.g.,expanding “Assn.” to “Association”), and replacing alternate variationsof emoticons with a primary form (e.g., replacing “;-)” or“˜_{circumflex over ( )}” with “;)”). Another example includesdeduplication of identical, near-identical (retweet/quoted reply), andthematically similar items

While the diagram 600 shows arrows progressing from one block toanother, in some cases each of the individual blocks 605, 610, 615, and620 may be performed in different orders, and in some cases not all ofthe example blocks 605, 610, 615, and 620 need be performed. In general,it may be apparent to those with skill in the art other processes foringesting data to be used in generating a natural language model, andembodiments are not so limited.

After the various types of data have been ingested at block 505, e.g.,any or all of the example blocks 605, 610, 615, and 620 are performed,at block 625, the totality of the ingested data may be processed for usein the natural language platform for generating one or more ontologiesand for conducting the adaptive machine learning process to generate oneor more natural language models.

Having described example details of block 505 for ingesting data, thenext FIGS. 7-10 describe further details of block 510, related todescribing example details of various ontologies and an example detailedprocess for creating the ontology, according to some embodiments.Referring to FIG. 7, a simple example of an ontology is shown throughthe various charts 700, 710, and 720.

Chart 700 shows a spreadsheet of two tasks, with each task includingmultiple labels. For example, a “sentiment” task includes three labels,“positive,” “negative,” and “neutral.” As an example, a user may desireto utilize a natural language model to perform a task of classifyingdocuments by the sentiment each document conveys. The various types ofsentiment that the documents may be categorized into are enumerated bythe labels under this sentiment task, i.e., “positive,” “negative,” and“neutral.” The set of documents may be a collection of tweets about abasketball game, for example. The views expressed in the tweets maytherefore be classified as either having positive sentiment, negativesentiment, or neutral sentiment, and the user or client may desire tohave a natural language model generated to evaluate these tweets andclassify them into one of these various labels.

In another example, still referring to chart 700, a “relevant” taskincludes two labels, “relevant,” and “irrelevant.” Here, the user maydesire to utilize a natural language model to perform a task ofclassifying documents according to their relevance, where thedetermination of relevance may be specified by the user or associatedanalysts or annotators. For example, the determination of relevance maybe based on whether a tweet among the collection of tweets pertains tothe basketball game. The collection of tweets may have been obtainedover the approximate period of time that the basketball games played,but since other events may be occurring at that time, it may not be aguarantee that all of the tweets necessarily discuss the basketballgame. Thus, the natural language model may be configured to evaluatethese tweets and classify them as either relevant or irrelevant,according to the criteria used to train the model.

Still referring to chart 700, a column for “subtasks” is shown, whichmay represent connections between one task to another. Examples ofsubtasks will be discussed in the later figures. All told, theserelationships according to the chart 700 comprise an example of a simpleontology. That is, a natural language model may perform the sentimenttask and/or the relevant task to classify a collection of documents (orspans). The natural language model may perform these tasks by evaluatingeach document among the set of documents and classifying them into oneor more of the labels under each task.

Referring to chart 710, the text of the chart 700 is expressed in atextual format. This example format may represent how the naturallanguage model is configured to ingest the ontology so as to perform theproper analysis.

Referring to chart 720, the example ontology formed by the words inchart 700 are shown here being organized in a hierarchical treestructure. That is, the overarching tasks, i.e., “relevant,” and“sentiment,” may represent the main nodes of the hierarchical structure,as shown. The labels associated with the respective tasks, i.e.,“positive,” “negative,” and “neutral” for the “sentiment” task, and “irrelevant” and “relevant” for the“relevant” task, are placed undertheir respective tasks as sub nodes.

Referring to FIG. 8, a more complex example of an ontology is described.In chart 800, the same tasks and associated labels for each task as inFIG. 7 are shown. In addition, a subtask of “sentiment” is listed and isassociated with the label “relevant” under the task “relevant.” Whatthis means, according to some embodiments, is that the task of“sentiment” and associated labels are to be embedded as sub nodes underthe label “relevant.”

Referring to chart 810, the text of the chart 800 is expressed in atextual format. This example format may represent may be how the naturallanguage model is configured to ingest the ontology so as to perform theproper analysis. Notice how the third word “sentiment” is placed afterthe label “relevant” and a delimiter (e.g., a comma). The naturallanguage platform may be configured to understand that the third word orphrase (e.g., words or phrases after a second delimiter the same line)represent a subtask, based on the organization defined in the first lineof chart 810.

Referring to chart 820, the example ontology formed by the words inchart 800 is expressed in a hierarchical tree structure according to theorganization defined by the headers in chart 800. That is, because thetask “sentiment” is listed as a subtask associated with the label“relevant,” the “sentiment” task and associated labels are placed as subnodes under the “relevant” label, which in itself is a sub node of the“relevant” task, as shown. In this way, when the natural language modelanalyzes various documents (or spans of documents) according to thestructure of this ontology, the natural language model will perform thesentiment analysis task only if the document has first been classifiedas being “relevant” when performing the “relevant” analysis task. Putanother way, if the natural language model analyzes a document (or aspan) and determines that the document or span is to be classified as“irrelevant,” then the natural language model will discontinue anyfurther analysis of the document or span, since the “irrelevant” labelhas no further sub nodes to travel down. In general, this exampleillustrates that the tree structure of the ontology can be used to guidethe varying levels of analysis that a natural language model is toperform when analyzing documents.

Referring to FIGS. 9A and 9B, an even more complex example of anontology is shown. Referring to chart 900, an additional task called“genre” is included in the ontology. The genre task is listed as asubtask under the “relevant” label. The genre task includes four labels,“mystery,” “comedy,” “action,” and “horror.” Also, the mystery, comedy,and action labels have “sentiment” as a subtask, while the horror labelhas no subtask listed. This example shows that a single task may belisted as a subtask under multiple labels, and that tasks can be listedas sub tasks in a cascading or hierarchical fashion (e.g., “sentiment”is a subtask of multiple labels in the “genre” task, which is itself asubtask of the “relevant” label of the “relevant” task).

Referring to chart 910, the text of the chart 900 is expressed in atextual format. This example format may represent may be how the naturallanguage model is configured to ingest the ontology so as to perform theproper analysis.

Referring to chart 920, the example ontology formed by the words inchart 900 is expressed in a hierarchical tree structure according to theorganization defined by the headers in chart 900. As shown, because thesentiment task is listed as a subtask for three of the labels in thegenre task, the sentiment task and its associated labels are embedded assub nodes under each of the three labels under the genre tasks, i.e.,the “action,” “comedy,” and “mystery” labels. In addition, it can beseen that the entire genre task and associated labels (and subsequentsubtasks) is embedded as a sub node under the relevant label, consistentwith the organization described in chart 900.

Still referring to chart 920, a natural language model may conductanalysis of documents or spans of documents according to thehierarchical structure as shown. Following the hierarchical structure ofthe ontology, a natural language model may start by conducting“relevant” analysis of the document or span to determine if the documentis “relevant” or “irrelevant.” If it is determined that the document isirrelevant, then the analysis ends. However, if the document isdetermined to be relevant, then the natural language model may proceedby conducting “genre” analysis, to determine whether the document shouldbe classified as fitting into the “action” genre, the “comedy” genre,the “horror” genre, or the “mystery”genre. If classified as fitting intothe horror genre, the analysis ends. However, if classified as any ofthe other three labels, the natural language model may continue byperforming “sentiment” analysis, and classifying the document by itssentiment according to the sentiment labels “negative,” “neutral,” or“positive.”

Referring to chart 930 in FIG. 9B, the example ontology described hereinis shown in a logical hierarchy format. Various tasks in the ontologyare shown in rectangular boxes, while the labels are shown in ovals. Thearrows show the progressive relationship for how a natural languagemodel may proceed through the ontology when analyzing a document. Thus,the first level analysis performed is to analyze relevance. If deemedrelevant, the next level of analysis performed is genre analysis. If thedocument is deemed to be categorized into one of the three labels otherthan “horror,” then sentiment analysis is performed last. This logicalstructure also illustrates how some ontologies may be structured in morecomplicated ways that a simple tree structure, as the analysis pathsafter “Mystery,” “Comedy” and “Action” all converge to continueperforming sentiment analysis.

Now that examples of ontologies and their applications have beendescribed, referring to FIG. 10A, flowchart 1000 provides additionalexample details for generating the ontology, according to someembodiments. The example processes of flowchart 1000 may representfurther details in block 510 of the high level flow chart of FIG. 5 forgenerating natural language models. These example sub processes maybegin at block 1005, which refers to ingesting training data and othertypes of data. Block 1005 may be consistent with block 505 and therelated details described in flowchart 600 of FIG. 6, where the exampleoutputs at block 625 may serve as the example inputs into flowchart1000. Again, this data may include a collection of documents or spans ofdocuments, one or more rules, and/or one or more pre-generatedontologies as defined by the client or user.

Having at least obtained a collection of documents intended for anatural language model to analyze, in some embodiments, the naturallanguage platform may be configured to first perform a topic modelingprocess, at block 1010. This topic modeling process may be consistentwith those descriptions of topic modeling described above, as well asexamples for performing topic modeling as described in applications(Attorney Docket Nos. 1402805.00012_IDB012, 1402805.00013_IDB013,1402805.00014_IDB014, 1402805.00016_IDB016, 1402805.00017_IDB017, and1402805.00019_IDB019), again incorporated herein by reference. The topicmodeling process may be used to determine one or more labels for thenatural language model to categorize the collection of documents into,and of which may be organized into a hierarchical structure according tothe following processes described below. In some embodiments, topicmodeling may not be performed, and instead the natural language platformmay access pre-generated or predetermined sets of topics as defined bythe client or user or previously obtained in the ingest data phase.

At block 1015, the natural language platform may be configured toconduct a rules generation process to confirm and refine rules that mayhelp define how the ontology is to be structured. For example, thenatural language platform may access a number of rules, where some maybe obtained from the previously ingested data, while other rules may beprogrammed or otherwise specified by human analysts with expertknowledge of the subject matter for which the natural language model isto be generated for. The natural language platform may receive theseinputs about new rules or modifying pre-existing rules from the analyststhrough one or more graphical user interfaces, such as through the oneor more application modules 225 as described in FIG. 2A. The rulesgeneration process at block 1015 may then include verifying theenumerated rules for logical consistencies. For example, the naturallanguage platform may be configured to evaluate the totality of rules tosee if any of the rules lead to infinite circular loops, or lead todefinitions of labels or tasks that lack proper node connectionssufficient to fit in the ontology. As another example, the rules may beevaluated to determine if two or more of the rules conflict with eachother, or if two or more of the rules are redundant.

At block 1020, the natural language platform may be configured togenerate a first version of the ontology based on the topics discoveredor previously obtained and the verified rules in the previous steps. Forexample, the topics may define the content of the nodes for an ontology,while the rules define the relationships among the various topics. Forexample, a rule may specify to perform a genre analysis task of adocument after determining whether the document is determined to berelevant. Thus, this rule may help define that the genre task should bea sub node under the relevant task. An example process for how rulescontribute to generating an ontology includes the following steps:

Given a goal of performing Genre analysis, over a collection ofdocuments discussing movies, an analyst may:

Use topic modeling to automatically identify relationships and clustersbetween documents;

Manually review the topics, recognizing that several of the identifiedtopics are discussing Horror movies. There may be multiple topicsrelated to Horror movies, e.g., a topic about the Halloween series offilms, identified by keywords Halloween, Myers, Curtis, Knife, and atopic about the Nightmare on Elm Street series, identified by keywordsNightmare, Krueger, Craven, and Dreams;

Manually create a classification (label) for Horror within the Genretask; and

Manually define rules that cause documents which include the words“Halloween” or “Nightmare” to be classified as Horror with greaterconfidence.

In some embodiments, it is possible to create an ontology without topicmodeling or rule-writing, by creating a classification scheme andannotating it. This can be sufficient if the classification problem iswell-defined and annotations can be crowd-sourced effectively. Sentimentanalysis is an example of this.

After applying all of the rules to the various topics to create ahierarchical structure for the ontology, at block 1025, the naturallanguage platform may be configured to test the effectiveness of theontology, by first generating one or more annotation questions designedto prompt one or more annotators to evaluate the ontology. For example,an annotator might be asked “is there positive sentiment expressed aboutthe car in this sentence?” or “is this statement an acceptable answer tothis question?” These annotation prompts may be displayed in the one ormore graphical user interfaces, such as through the application modules225. In some embodiments, analysts may interact with these annotationprompts at the client site, while in other cases the analysts may bestationed locally near the natural language platform.

Referring to FIG. 10B, illustration 1027 shows an example annotationinterface to collect input in the form of documentannotations/classifications from human analysts and other experts,according to some embodiments. These annotations can be used to evaluatethe effectiveness of the ontology and to train the natural languageprocessing models. The panel on the left includes a sample document,while the panel on the right includes an annotation prompt and a seriesof labels in the ontology that the human analyst may select to specifyunder which label(s) the document belongs to.

Referring back to FIG. 10A now, at block 1030, the natural languageplatform may be configured to receive the various annotations entered bythe human analysts, such as those inputs provided at the exampleinterface in FIG. 10B. These annotations may provide a form of feedbackto rate the quality of the ontology structure.

At block 1035, the natural language platform may be configured toevaluate the performance or effectiveness of the ontology based on thereceived annotations. Evaluation metrics may include the labeldistribution (how frequently does the annotator select each label) andthe proportion of annotated documents are not assigned any label.

Referring to FIG. 10C, illustration 1037 shows a sample annotationinterface to collect input in the form of documentannotations/classifications from human experts, according to someembodiments. These annotations can be used to evaluate the effectivenessof the ontology and to train the natural language processing models.

Referring back to FIG. 10A now, in some embodiments, the ontologygeneration process within block 510 may involve refining the ontologythrough a series of iterations. For example, it may be determined atblock 1035 that the ontology in its current form does not satisfycertain performance criteria. Again, these evaluations may be based onthe feedback supplied by the annotations. To improve the ontologycontent and structure, at step 1040, it may be determined that new rulesshould be generated or that existing rules may need to be modified.Thus, the ontology creation process may cycle back to block 1015. Fromhere, the natural language platform may facilitate generation of newrules or modification of existing rules through the graphical userinterface, so as to receive inputs from analysts to modify the rules.The rules generation process may again verify the veracity of thetotality of rules to make sure that there is logical consistency. Theprocess may then proceed to block 1020, then to block 1025, and so on togenerate a refined ontology in a second iteration.

In other cases, it may be determined that modifying the rules may not benecessary as the accuracy is already at a level acceptable to the systemuser, but at the least, that the ontology may still be improved based onthe annotations provided previously. Thus, at step 1045, the naturallanguage platform may cycle back to block 1020 and may refine theontology by incorporating the feedback from the annotations. Similar tostep 1040, the ontology creation process may then proceed to block 1025,then to block 1030, and so on in a second or later iteration.

The iterative process for continually refining the ontology betweensteps 1040 and steps 1045 may continue a number of times until theevaluation of the ontology at block 1035 satisfies one or moreperformance criteria like inter-annotator agreement, using a metricderived from Krippendorf's Alpha. If it is determined that theperformance criteria is satisfied at step 1050, then at block 1055, theontology may be considered complete enough and may be used in the nextstep of the natural language model generation process, which is theadaptive machine learning process at block 515.

In some embodiments, at block 1060, during the evaluation phase of theontology creation, analysts may evaluate the data and determined thatcertain clarifications may be needed to distinguish among various labelsor tasks within the ontology. That is, part of the ontology refinementprocess may include trying to disambiguate two or more topics from eachother, because the topics may be very closely related or potentiallyconfusing. For example, when trying to classify news articles about afinancial crisis, the label of “legal” articles may be potentiallyconfusing or overlapping in some ways with the label of “securityfraud.” While there may be a number of ways to distinguish between thesetwo labels, the analysts may provide and the natural language platformmay be configured to receive manual inputs to specify particularly howit is desired for these two topics to be distinguished. In order to dothis, analysts may help generate training guidelines that may be usedfor instructing annotators in the adaptive machine learning processphase. These training guidelines may be expressed as human readableinstructions that help define what the different labels mean, and/or howto choose among two or more labels when it is possible for a document tobe classified into the two or more labels. For example, to distinguishbetween “legal” and “security fraud,” a training guideline may instructthe annotator to always classify the document as “security fraud” if itappears the document may fall under both categories. In other cases, thetraining guideline may attempt to provide definitions to help determinewhat constitutes “security fraud” compared to “legal” generally. In someembodiments, these training guidelines may be displayed to theannotators in one or more of the graphical user interfaces. An exampleof these training guidelines as described more in application (AttorneyDocket No. 1402805.00013_IDB013), which again is incorporated herein byreference.

Having now described an example of a more detailed process for creatingthe ontology (see block 510 of FIG. 5), FIGS. 11A-13 will focus on moredetailed examples of generating the natural language model through theadaptive machine learning process (see block 515 of FIG. 5), accordingto some embodiments.

Referring to FIG. 11A, illustration 1100 shows an example display forannotating as part of the adaptive machine learning process, accordingto some embodiments. As previously mentioned, the adaptive machinelearning process according to aspects of the present disclosure includean iterative interplay between annotations of documents and machinelearning to generate the natural language model based on theannotations. Illustration 1100 shows an example of one type ofannotation applied to a document. In some embodiments and as previouslyalluded to, the display interface for allowing a human annotator tosupply an annotation is called a work unit. In some embodiments, eachwork unit represents a single task for supplying a single annotation toone document (or span). In some embodiments, the natural languageplatform is configured to efficiently generate work units such thathuman annotators' time is efficiently utilized.

As shown in illustration 1100, examples of a work unit may include thetext from a document or at least a subset of text. As shown, forexample, the document on the left side of the work unit is a tweet fromTop World News. The text states, among other things, “J.P. Morgan facesscrutiny over Asia hiring practices: J.P. Morgan Chase's hiring of theson of a Chinese . . . ” This document also includes a photo and a titleand introductory words to a news article that is referenced in theTweet. On the right side of the work units is an annotation prompt, thatincludes a subject header and a human readable question. The annotationprompt, including the human readable question, may be generatedautomatically by the natural language platform. The natural languageplatform may determine the form of the question so as to mostefficiently retrieve the desirable annotation from the annotator. Inthis case, the natural language platform has generated a human readablequestion that prompts the annotator to select as many topics (labels)that the annotator believes best applies to the document on the left.This type of question is a rather open ended question, and in othercases, the natural language platform may be configured to generate morepointed questions when it has determined that it needs more specificinformation. In this case, the annotator has selected two labels that heor she believes is most applicable: “legal & compliance,” and “corporatenews.” Once submitted, this particular document would then have thisparticular annotation applied to it. Additional annotations may beapplied to this document based on inputs from other annotators, when theother annotators see this particular work unit or other work units thatinclude this particular document.

Referring to FIG. 11B, illustration 1150 represents another example of awork unit used to annotate a document, according to some embodiments.Like before, the text of the document to be annotated is shown on theleft. On the right side of the work unit is the annotation prompt, whichincludes a header and a human readable question for the annotator toanswer. In this case, the question of “What is the best label for thisdocument?” allows for the annotator to select only one of three options:“highly relevant,” “semi-relevant,” and “totally irrelevant.” In someembodiments, the annotator may have been provided training guidelines tohelp define for the annotator what it means for something to berelevant, semi-relevant, or irrelevant. For example, a trainingguideline may have instructed the annotator to determine the relevanceof the document to the topic of “corporate finance.” In addition, thetraining guideline may have provided definitions to distinguish betweenthe document being highly relevant versus merely semi-relevant. In someembodiments, a button or icon for the training guideline may be placedin the display of the work unit that displays the training guidelinewhen the button or icon is pressed. In other cases, a previous displayscreen may have provided instructions based on the training guidelinesto the annotator to prepare the annotator for how to respond to theannotation prompts.

Still referring to illustration 1150, the type of annotation promptshown here is a more narrow scope than in illustration 1100. Here, theannotator is restricted to selecting only one of the three options. Incontrast, the annotator in illustration 1100 is allowed to select asmany options as deemed relevant. In some embodiments, these differencesin questions and their intended scopes are intentionally andautomatically produced by the natural language platform and may be basedon what type of information the natural language platform desires togain from the annotators.

For example, if the natural language platform is in the early trainingphases for generating the natural language model, general informationabout the various types of documents may be desired. Therefore, thetypes of annotation prompts may be more open ended, and may allow forannotators to answer broad questions, such as questions that may havemultiple answers (e.g., see illustration 1100).

As another example, if the natural language platform is in more maturetraining phases, more specific information to refine determinationbetween labels may be desired. Therefore, the natural language platformmay desire specific information to determine differences between twospecific labels. The natural language platform may be configured togenerate an annotation prompt that directs the annotator to distinguishthe two specific labels using a pertinent document. In this case, anexample annotation prompt may include the question, “Is the best labelfor this document label ‘A’ or label ‘B’?” Then, the natural languageplatform allows the annotator to select between only label A or label B.

In general, the natural language platform may be configured to generateannotation prompts with varying degrees of specificity, depending on thetype of information desired to better refine the natural language model.Other examples can include true/false questions, yes/no questions, andother types of questions that direct and annotator to varying levels ofspecificity apparent to those with skill in the art, and embodiments notso limited. These varying degrees of specificity also help toefficiently designate a human's time. As there are thousands, millions,or even billions of documents that could be annotated, a human's timecan be valuable and scarce, such that the techniques presented hereinshould determine what documents are most appropriate to be sent to humanannotators, and what types of information about the documents should beobtained.

With various examples of annotations now described, referring to FIG.12, flowchart 1200 provides a more detailed process flow for conductingthe adaptive machine learning process using the annotations to generatethe natural language model, according to some embodiments. The exampledetails herein may represent further details in block 515 of FIG. 5.

This sub process may start with block 1205, involving natural languageplatform ingesting the data and inputting or applying an ontology foruse in the adaptive machine learning process. Examples of theseprocesses are described in previous figures, such as FIG. 6 and FIG. 10.Having accessed the proper data, at block 1210, the natural languageplatform may begin the adaptive machine learning process to generate anatural language model by selecting particular documents to beannotated. The process for selecting which documents are to be annotatedmay be referred to herein as “intelligent queuing.”

In general, annotations to documents may be useful in aiding a computerto find patterns in the text of the collection of documents in order todetermine how the document should be classified. The higher number ofdocuments that are annotated, the more reliable it should be for thecomputer to create an accurate natural language model for the overallcollection of documents. However, obtaining annotations for tens ofthousands or even millions of documents in a collection may beimpractical, and in some cases, a substantially fewer number ofdocuments that are carefully selected may help produce a naturallanguage model with substantially the same performance. For example,obtaining annotations for certain documents that are representative ofthe various topics or labels in the ontology, as opposed to exhaustivelyobtaining annotations for all documents in a single topic, may be a moreefficient use of a human annotator's time. As another example, it may bemore useful to obtain an annotation for a document that is more complexor possesses content that may arguably be categorized into two or morelabels, as opposed to a document that is unambiguously within just asingle label, because the annotation may help the machine learn what todo in potentially ambiguous cases consistent with this document. Theprocess of performing intelligent queuing includes these processes ofcarefully selecting which documents among an entire collection ofdocuments should be slated to receive annotations, with one objectivebeing to efficiently utilize the human annotator's time.

When performing intelligent queuing in a first iteration, i.e.,conducting block 1210 for the first time, the intelligent queuingprocess may include obtaining a seed set of documents that isrepresentative of the various tasks and labels as outlined in theontology. Other factors may be included in the seed set of documents,and further details of intelligent queuing are described in application(Attorney Docket No. 1402805.00012_IDB012), which again is incorporatedherein by reference in its entirety.

After selecting which documents are to be annotated, the process ofintelligent queuing at block 1210 may also include generating theannotation questions to be presented to an annotator (or multipleannotators) in order to elicit an annotation response. Examples ofvarious types of annotation questions include the examples in FIGS. 11Aand 11B. As previously mentioned, the wording of the question includingthe level of specificity desired to be elicited from the annotator maybe automatically generated by the natural language platform, based onthe type of information desired when performing intelligent queuing.Thus, block 1210 may also include causing display of the annotationprompt in a work unit that is accessible to an annotator. This may bedisplayed in one of the application modules 225, for example. Otherexample displays are provided in application (Attorney Docket No.1402805.00013_IDB013), which again is incorporated herein by reference.

In some embodiments, providing annotation questions in the intelligentqueuing process may also include presenting the annotation questions ofthe same document to multiple annotators. In this way, a singleannotator may not be relied on to provide the annotation to a singledocument. For example, a single annotator may make mistakes, or in worstcases a single annotator may intentionally provide wrong answers. Forexample, obtaining annotations from annotators may involve acrowdsourcing process, and it may not be known the reliability of anysingle annotator.

At block 1215, the natural language platform may be configured toreceive annotations in response to the annotation questions generated inblock 1210. In between blocks 1210 and 1215, one or more annotators mayinteract with a graphical user interface to analyze one or moredocuments and provide annotations according to the annotation promptssupplied by the natural language platform. For example, an annotator mayanswer a question similar to one of the questions in FIG. 11A or 11B.These responses may be stored in a database repository, such as in thedatabase modules 210.

At block 1220, in the event that multiple annotators supply annotationsto the same document, in some embodiments, the natural language platformmay be configured to conduct an annotation aggregation process thatanalyzes all of the annotations made to a single document to determinewhether said document may be reliably annotated for use in training thenatural language model. For example, if six annotators providedannotations to the same document, and three annotators classified thedocument as belonging to a first label, while the other three annotatorsclassified the document as belonging to a second label, then theannotation aggregation process of block 1220 may determine that thisdocuments is too ambiguous to be used in the adaptive machine learningtraining process. As another example, if five of the six annotatorsprovided a consistent annotation to the document, while a sixthannotator provided a different annotation, it may be determined thatthis document has sufficiently consistent annotations so as to be usedin the adaptive machine learning process.

In general, in some embodiments, the annotation aggregation process ofblock 1220 may include computing an annotation aggregation score basedon the total number of annotations and reflecting a level of agreementbetween the annotations. Based on this computed score, in someembodiments, the annotation aggregation process may determine whetherthe document may be used to train the natural language model, e.g., ifthe score satisfies a first threshold criterion. In addition, in someembodiments, based on the score, it may be determined that additionalannotations from other annotators should be elicited before the documentmay be used in the model training process, e.g., if the score satisfiesa second threshold criterion. In addition, in some embodiments, based onthe score, and may be determined that the document should not be usedfor the model training process, e.g., if the score satisfies a thirdthreshold criterion. This process may be applied for each document todetermine if said document is valid for training the natural languagemodel. Further details about the annotation aggregation process aredescribed in application (Attorney Docket No. 1402805.00019_IDB019),which again is incorporated herein by reference in its entirety.

At block 1225, the natural language platform may be configured togenerate the natural language model using the aggregated annotated data(e.g. documents with consistent enough aggregation score) in an adaptivemachine learning training process. In general, the adaptive machinelearning training process may find patterns in the training data (e.g.,aggregated annotated data) that may be used to make predictions aboutuntested data. The process for adaptive machine learning according tosome embodiments includes: (a) elicit annotations from humans (b) buildinitial version of model (c) evaluate the effectiveness of the model anddetermine if stopping criteria have been met (d) identify specific areasof improvement and annotations required for improvement in those areas(e) queue additional documents for human annotations, and serve thosedocuments to the optimal annotator with the optimal interface for thecombination of annotator and task (f) update the model using the newannotations (g) repeat the whole process until stopping criteria aremet. From this, a first version of the natural language model may begenerated, and, if desired, continually updated new versions of themodel as well.

In general, the natural language platform according to aspects of thepresent disclosure first seeks to obtain annotations for particulardocuments before conducting any machine learning techniques. Receivingannotations before conducting any machine learning techniques may allowfor the adaptive machine learning process to be more efficient, in thatthe adaptive machine learning begins building the natural language modelfrom documents whose “true values” are already known, e.g., thedocuments already have reliable annotations.

At block 1230, the natural language platform may be configured to testthe performance of newly created natural language model, according tosome embodiments. Testing the performance of the natural language modelmay allow the natural language platform (and/or a user) to determinelater on whether the natural language model should be refined andimproved before being finalized.

In some embodiments, testing the performance of the natural languagemodel at block 1230 includes performing a cross validation process. Forexample, the cross validation process may involve creating a first“test” natural language model, using the same adaptive machine learningtraining process to generate the original natural language model, butbased on only a subset of the training data, e.g., 90% of the documentsin the training data (e.g., supposing there are 1000 documents in thetraining data, then documents 1-900). This first “test” natural languagemodel is tested by applying the first “test” natural language model onthe remaining 10% of the documents not used to create the first “test”natural language model, but still already having annotations. In thisway, the veracity of the first test natural language model may becompared against “true answers” provided by the original annotations ofthe remaining 10% of the documents. The cross validation process mayalso involve creating a second “test” natural language model, using thesame adaptive machine learning processes to generate the originalnatural language model, but based on a different subset of the trainingdata, e.g., a different 90% of the training data (e.g., documents 1-800and 901-1000). This second test natural language model may be tested inthe same way as the first test natural language model, and therefore mayassess the veracity of the second test natural language model againstthe second 10% subset of the training data (e.g. documents 801-900).This cross validation process may continue each for a different 90% ofthe training data (for a total of 10 different iterations), so that theveracity of the machine learning process used to the create originalnatural language model (and each of the test natural language models)may be verified against different subsets of the training data, andultimately against all of the training data divided into separate parts,where the “true answers” are known. Because all of the training data isbeing tested against, and because the test natural language models aregenerated using the same machine learning process as the one used togenerate the original natural language model in question, the aggregateperformances of the test natural language models is used as a reliableapproximation to the original natural language model. In general,subdividing the training data into 10% increments is merely one example,and other partitions may be used.

At block 1235, after having tested the performance of the naturallanguage model, either through the example cross validation process justdescribed or through other validation processes apparent to those withskill in the art, the natural language platform may be configured togenerate a performance metric of the natural language model, based onthe tests run against it. For example, if the performance of the naturallanguage model was determined using the after mentioned cross validationtechniques, one or more metrics quantifying a degree of accuracy may besupplied. In other cases, one accuracy measurement for each of the“test” models in the cross validation process may be supplied, and thenatural language platform may be configured to provide an aggregatescore at block 1235. In some embodiments, these performance metrics maybe displayed, such as in application modules 225. Other examples ofthese performance metrics are described in application (Attorney DocketNo. 1402805.00012_IDB012), which is again incorporated herein byreference. Example performance metrics include: model precision, recall,and f1-measure, derived through cross-validation; area under thereceiver operating characteristic curve (AUC); rate of improvement inprecision, recall, and/or f1-measure as a function of the number ofannotations; and percentage of the feature vocabulary in the totalcollection that is represented in the annotated data set.

From here, the natural language platform may determine if theperformance of the natural language model satisfies one or moreperformance criteria. The one or more performance criteria may include apre-determined threshold, or in other cases may be specified by a user.In the event that the natural language model satisfies the performancecriteria, the process may proceed to block 1245. Otherwise, the processmay proceed to block 1240, where the natural language model is to berefined through one or more optimization techniques according to someembodiments.

At block 1240, supposing that the natural language model fails tosatisfy the one or more performance criteria, the natural languageplatform may be configured to perform various optimization techniques toimprove the performance of the natural language model, according to someembodiments. Numerous examples of optimization techniques are describedin FIG. 13 below. Some or all of the techniques may be performed, andembodiments are not so limited.

In some embodiments, one of the optimization techniques includesconducting another iteration of annotations on the same or differentdocuments among the collection of documents. In other words, the processin flowchart 1200 may cycle back to block 1210, consistent with theiterative adaptive machine learning process described above. Forexample, through the tested performance of the natural language model atblock 1230, it may have been determined that the natural language modelaccurately predicts documents for certain labels, while it struggles topredict documents into other labels in the ontology. The naturallanguage platform may therefore be configured to conduct a seconditeration of intelligent queuing that aims to obtain annotations fromdocuments to help resolve labels that the original natural languagemodel showed unsatisfactory performance in. Further example details ofhow the tested performance of the previous iteration of the naturallanguage model may be used to guide a subsequent instance of intelligentqueuing is described in application (Attorney Docket No.1402805.00012_IDB012), which is again incorporated herein by reference.

To complete one or more iterations of refinement of the natural model,the adaptive machine learning process may continue through blocks 1215through 1235, consistent with the descriptions above but with newannotations and possibly other optimization techniques performed togenerate the revised natural language model. In other cases, after oneor more optimization techniques are performed at block 1240, rather thanrepeat through block 1215 to 1220, the model generation process mayproceed straight to block 1225 to generate an updated natural languagemodel using the updated parameters from the one or more optimizationtechniques and the existing aggregated annotations. Ultimately, it maybe determined that the natural language model satisfies the one or moreperformance criteria, at which point the process may proceed to block1245. Here, the finalized natural language model may be transmitted tomake predictions on untested data. The finalized natural language modelmay be transmitted to a system exclusively controlled by a client, suchas in some of the examples shown in FIG. 2A, or in other cases may betransmitted to specific devices, such as devices 120 or 130.

In some embodiments, the transmitted natural language model may bestored in a format that is universally portable. For example, thenatural language model may be stored in memory efficiently such that thesame model may be executable in mobile devices as well as conventionalcomputers. Additional details of the universal portability aspectaccording to some embodiments are described further in application(Attorney Docket No. 1402805.00018_IDB018), which is incorporated hereinby reference in its entirety.

At block 1250, the natural language model may be used to analyzeuntested data. The untested data may be based on data obtained from theclient not used as part of the training data, while in other cases theuntested data may include live or real-time data, such as live streamsof tweets or customer service emails. In general, the natural languagemodel may analyze untested data by classifying each unit of the untesteddata into one or more of the tasks or labels as organized in theontology. The natural language model may be used on millions or evenbillions of units of data.

Referring to FIG. 13, illustration 1300 provides further details ofdifferent examples of optimization techniques that may be applied inblock 1240 of FIG. 12 for performing optimization techniques to improvethe natural language model, according to some embodiments. The followingare some descriptions of each of the different examples listed, althoughsome embodiments may use only some of these examples, and embodimentsare not so limited.

At block 1305, one example for performing optimization techniques toimprove the natural language model may include using the results fromcross validation or other performance evaluation techniques to guide theselection of further documents to be annotated when performing a nextiteration of intelligent queuing. As previously mentioned, additionalannotations to documents, either to some of the same documents or evennew documents, may be sought after to help resolve some ambiguities inthe natural language model or to help improve areas that the naturallanguage model has shown difficulty in classifying. To determine whichdocuments should be used in this next round of annotations, theintelligent queuing process according to aspects of the presentdisclosure may base these decisions on the performance results generatedby performing the cross validation or other performance evaluationtechniques from block 1230 of FIG. 12. Further example details for howthe intelligent queuing process utilizes the performance evaluationtechniques from block 1230 are described in application (Attorney DocketNo. 1402805.00012_IDB012), which is again incorporated herein byreference. Block 1305 may also include a process for what type ofannotation questions should be presented to an annotator to achieve theproper level of detail for improving the natural language model.

At block 1310, another example technique includes performing theintelligent queuing process itself, which is to say that the processproceeds back to block 1210, which has already been discussed.

At block 1315, another example optimization technique includesperforming feature selection. Feature selection may include applying oneor more transformations to each annotated document to generate a set ofmachine learning features for that document, applying one or moreevaluation metrics to each feature to estimate the relative importanceof each feature to the natural language model, and finally selecting anappropriate subset of features for inclusion in the natural languagemodel. For example, every token in each document may be designated as afeature, and each feature may be ordered based on the frequency itappears in the total set. Other examples may use statisticalco-occurrence criteria (such as chi square and mutual information) orinformation theoretic criteria (such as information gain) to performfeature evaluation. The feature transformation process is described inmore detail in application (Attorney Docket No. 1402805.00017_IDB017),again incorporated herein by reference.

At block 1320, another example includes performing padding andrebalancing of the data in a natural language model. This may includeduplicating or deleting specific annotations to create an artificial mixof training items to improve model performance in specific areas.Padding and rebalancing may include identifying individual items in thedata that are representative of rare but significant labels in theontology and artificially duplicating these items so that they make up alarger proportion of the training data used to develop the naturallanguage models. This technique can also be referred to as “upsampling.”For example, if positive items are rare relative to negative items in aset of training data used to develop a sentiment classification model,the positive items identified during annotation can be copied one ormore times and included in the training data for the model. The numberof times that the positive items are copied can be determined through anumber of optimization techniques in machine learning, including crossvalidation. Cross validation can identify the number of positive andnegative items to include in training that will achieve the best modelperformance (as measured by accuracy, precision, recall or f-score) bycyclically adjusting the numbers of items in each category, retrainingthe model, and testing the performance of the resulting model.Alternatively, individual items in the data that cause the naturallanguage model to make erroneous classifications may be removed from thetraining data to improve model performance. This technique can also bereferred to as “downsampling.”

At block 1325, another example includes performing pruning techniques.Pruning techniques may include processes to reduce the size andcomplexity of certain types of statistical models such as decision treesthat are used to develop natural language models. Pruning has thebenefit of reducing the tendency of a model to represent noise in thedata (overfitting) by removing parts of the model that are useful inclassifying only a small number of items in the training data. Forexample, a natural language model that is trained to distinguish fivedifferent topical categories of news articles using decision trees andindividual word features, may perform well on training data, but poorlyon unseen data. In order to improve the generalizability of the modeland to make performance more consistent, certain nodes or features usedby the model that only help in classifying less than 1% of the data maybe trimmed away or removed to simplify the model and improve performanceacross different data sets.

At block 1330, another example includes performing ensemble or votingmethods. This may include combining multiple pre-trained naturallanguage processing models to generate multiple predictions for eachunlabeled item in the data set, identifying the items in the datasetthat meet specific prediction matching criteria (e.g., unanimous,majority, at least n), and using these items to train a new naturallanguage processing model. This approach may sometimes be referred to assemi-supervised machine learning. In some embodiments, the pre-trainedmodels have been trained using a subset of training items from thetarget data set. In other embodiments, the pre-trained models have beentrained using training items from external data sets. A variety ofmatching criteria can be applied to the resulting predictions toidentify items that meet the minimum requirements for being included astraining items in a new natural language model. In some embodiments, amajority vote of the models will determine the label that is assigned toan item in the dataset. For example, in an ensemble of six differentsentiment models, if four models predict that an unlabeled tweet ispositive, and two models predict that that same tweet is negative, theitem will be assigned a positive label and subsequently used fortraining a new sentiment model. On the other hand, if three of themodels predict that an unlabeled tweet is positive, and three modelspredict that it is negative, the item will not be assigned a label andwill not be used for training a new sentiment model. In otherembodiments, an unlabeled item will receive a label only if that labelcorresponds to at least n matching predictions from the ensemble ofclassifiers. In other embodiments, an unlabeled item will receive alabel only if that label corresponds to a unanimous set of predictionsfrom the ensemble of classifiers.

At block 1335, another example includes performing feature discovery.This may include identifying the set of feature transformations andfeature selection critera that provide the highest model performance forthe provided data. In some embodiments, the set of selected featuretransformations and feature selection criteria can be represented as astring of independent binary values. For example, each supported featuretransformation or selection criteria may be assigned a position withinthe string, where a 1 value corresponds to the feature or criteria beingenabled and a 0 corresponds to disabled. In such embodiments, a geneticsearch process may then be used to iteratively define numerousconfigurations, training and evaluating each in sequence according tothe training and testing processes 1225, 1230 and 1235 in FIG. 12,refining and improving the configuration each iteration. In suchembodiments, the final configuration of feature transformations andselection criteria may be saved to a database so that later applicationsof the feature selection process 1315 use the optimized configuration.

At block 1340, another example includes performing model interpolation.This may include combining model parameters from at least one previousmodel with a provided set of model parameters for the desired model tocreate an interpolated model which has greater performance than would beachieved from the provided set of model parameters. In some embodiments,the parameters incorporated from the at least one previous model mayinclude the annotations and documents used to train each previous model,and the provided parameters for the desired model may include a set oflabel names, documents, annotations, and rules. In some embodiments, theinterpolated model is generated by selecting the parameters from the atleast one previous model which do not conflict with the providedparameters, and proceeding with the model training and optimizationprocesses using the combined data set. For example, a previous model mayinclude a large set of English social media posts from Twitter®discussing a variety of topics of general interest (movie releases,consumer electronics, recent personal experiences, and current events),and annotations imputed to each document according to the sentiment(“positive” or “negative”) conveyed by that document. Conventionally,such models perform with poor accuracy when applied to process text thatuses highly specific and unique terminology; for example, social mediaposts about the finance industry. According to some embodiments, aninterpolated model may be generated by providing a small amount ofparameters (for example, a rule defining “overweight” as conveying“positive” sentiment), combined with the documents and annotations fromthe previous model parameters that do not conflict with this. In thisway, highly accurate domain-specific models may be interpolated from alarge amount of previous parameters and a comparatively small amount ofprovided parameters.

At block 1345, another example includes developing new rules for use indefining the ontology or additional annotations. Further descriptionsfor how the new rules may be implemented are described in FIG. 14,below.

At block 1350, another example includes tuning model traininghyper-parameters. Examples of model hyper-parameters includeregularization constraints applied to the entire model training process(such as a constraint on the L₁ or L₂ norms), e.g., at block 1355, oradding a smoothing term to each feature individually during the featureselection process (such as a Laplacian smoothing term), e.g., at block1360. Conventional machine learning techniques produce models whichattempt to closely match their training data. For example, in a set ofdocuments about “Bill Clinton” that each have associated annotationsimputing “positive” or “negative” sentiment, a small subset of suchdocuments may be found to also discuss “Kenneth Starr,” with anoverwhelming percentage of this subset imputed as conveying “negative”sentiment. In such a case, a conventional model training process maylearn to interpret “Kenneth Starr” as strongly conveying “negative”sentiment, although this learned interpretation is an artifact oflimited training examples rather than an actual conveyance of “negative”sentiment; such a case is known to those with skill in the art as anexample of “over-fitting.” In such cases, introducing hyper-parameterconstraints can be used to penalize rare terms, such that the modeltraining process will not treat such terms as strong conveyances of anylabel. In some embodiments, each hyper-parameter may assume an infinitenumber of possible values (for example, a scale factor or additive termthat may be assigned any numerical value). In such embodiments,selecting appropriate hyper-parameter values involves training andevaluating a large number of models with different hyper-parameterconfigurations. In some embodiments, Monte Carlo techniques, e.g., block1365, such as the Metropolis-Hastings algorithm may be used toiteratively select and refine hyper-parameter values until a suitablyaccurate model is created or a maximum number of iterations have beenperformed.

In general, the iterative process as described in flowchart 1200 mayefficiently develop a natural language model by identifying and focusingon problem areas of the natural language model through multipleannotation and test phases. In this way, whereas conventional processesfor developing a natural language model may take up to six months,achieving the same performance through aspects of the present disclosuremay take only three weeks. In general, the processes described hereinfor producing a natural language model drastically reduce time toconvergence and efficiently utilize time spent by humans to provideinput for the machine learning components.

Referring to FIG. 14, flowchart 1400 illustrates a modified process forgenerating natural language models, including iterating processesbetween some processes within the create ontology block and the conductadaptive machine learning block of FIG. 5, according to someembodiments. In previous descriptions, iterative processes weredescribed to occur within the create ontology block 510 and within theconduct adaptive machine learning block 515. In some embodiments, athird relationship of iterations may also be allowed, such as processeswithin the adaptive machine learning block 515 iterating back to certainblocks within the create ontology block 510. For example, as shown inflowchart 1400, while blocks 1405, 1410, 1415, and 1425 may includedescriptions that are consistent with a first iteration of processesdescribed in previous figures, such as in FIGS. 5, 10, and 12,additional iterative loops may also be possible from the adaptivemachine learning block 1425.

For example, after at least a first iteration of the natural languagemodel is generated at block 1425, consistent with the descriptions inFIG. 12, it may be determined that the ontology could be modified toimprove model performance. Thus, from block 1425, the natural languageplatform may be configured to proceed back to block 1420 to allow forthe ontology to be modified. For example, the natural language model mayaccess inputs provided by analysts to modify one or more of the labelsor tasks in the ontology.

As another example of an iterative loop, it may be determined that rulesmay be modified to improve performance of the natural language model.Thus, from block 1425, the natural language platform may be configuredto proceed back to block 1415, which in some cases is a sub processwithin the overall process for creating the ontology. To modify some ofthe rules, the natural language platform may be configured to apply aweight to each of the rules, where the weight represents a degree towhich the rules should be heeded in comparison to other rules. Forexample, a weight of 1.0 for a rule may signify that the rule shouldalways be followed, whereas a weight of 0.5 may signify that the ruleshould be followed according to a probabilistic formula that may factorin additional statistical parameters, resulting in a compositestatistical score to determine how to classify a document. Thus,modifying the rules may also include adjusting any weights to the rules.Modifying the rules may also involve creating new rules or modifying thelogical parameters to which the rules should be applied. After any rulesare modified at block 1415, the process may again proceed to block 1420,and continue down the process in another iteration.

As another example of an iterative loop, and may be determined that newtopics should be discovered or topics may need to be refined or modifiedin some way. Thus, from block 1425, the natural language platform may beconfigured to proceed back to block 1410 to conduct topic modelingagain, which in some cases is a sub process within the overall processfor creating the ontology. To modify any topics, the natural languageplatform may be configured to accept various parameters for discoveringnew topics. For example, new documents may be accessed, or certaindocuments may be given more weight. In other cases, the natural languageplatform may receive inputs to merge certain topics together into asingle composite task or label within the ontology. In other cases, thenatural language platform may be configured to receive inputs to divideone or more topics into multiple topics. Analysts may be offereddisplays of various metrics that express levels of agreement or levelsof performance of the natural language model when the topics areorganized in one configuration versus another configuration. Theanalysts may then be able to compare how the model may improve ifvarious topics are modified. Example displays for these variousinterfaces for modifying the topics, the rules, or the ontology may bedescribed further in application (Attorney Docket No.1402805.00013_IDB013), which again is incorporated herein by reference.

In some cases, after modifying any of the topics at block 1410, theprocess may proceed directly to block 1420 to modify the ontology inlight of the newly modified topics. In other cases, and may also bedesirable to modify any rules, so the modification iteration may movefrom block 1410 to block 1415 before proceeding to block 1420.

In some embodiments, these iterative processes may cease onceperformance of the natural language model is determined to satisfy somethreshold criterion. At that point, the natural language model may beavailable for use to analyze untested data at block 1430.

Referring to FIG. 15, flowchart 1500 provides an additional variant forgenerating natural language models, this time including a maintenanceprocess for updating currently operating natural language models,according to some embodiments. As shown, the various functional blocksin flowchart 1400 remain the same, but in addition, after the naturallanguage model is in use to analyze untested data at block 1430, anadditional block 1510 is provided to modify the currently runningnatural language model with updated data. The model may need to beupdated periodically in order to adjust to changing language that mayoccur in certain collections of documents. For example, if the model isanalyzing communications on Twitter, new idioms and slang may inevitablybe communicated in tweets over time. A model trained six months ago maytherefore lack proper training for how to handle any new changes inlanguage.

Rather than generate a completely new model, the existing naturallanguage model may simply be retrained with updated data. The retrainingmay include revising or discovering new topics, modifying or creatingnew rules, or modifying the ontology structure to account for theupdated data. Thus, the natural language platform may be configured toretrain the model with updated data by proceeding to either block 1410,1415, or 1420, depending on the need. In some embodiments, modificationsmay include simply conducting one or more rounds of adaptive machinelearning at block 1425, first by performing intelligent queuing togenerate new annotations for the updated data.

In general, generating natural language processing models using thedescriptions provided herein allows for: production of the models in asubstantially shorter time period than conventional methods (e.g.,highly accurate and fast performing models generated in a period ofweeks rather than months), highly adaptive models that can evolve withchanging input data, maintaining of consistent levels of modelperformance over time, the ability to leverage existing models andtraining data to develop domain-specific models with fewer newannotations, and the ability to add new labels and categorizations asrequirements change.

Referring to FIG. 16, the block diagram illustrates components of amachine 1600, according to some example embodiments, able to readinstructions 1624 from a machine-readable medium 1622 (e.g., anon-transitory machine-readable medium, a machine-readable storagemedium, a computer-readable storage medium, or any suitable combinationthereof) and perform any one or more of the methodologies discussedherein, in whole or in part. Specifically, FIG. 16 shows the machine1600 in the example form of a computer system (e.g., a computer) withinwhich the instructions 1624 (e.g., software, a program, an application,an applet, an app, or other executable code) for causing the machine1600 to perform any one or more of the methodologies discussed hereinmay be executed, in whole or in part.

In alternative embodiments, the machine 1600 operates as a standalonedevice or may be connected (e.g., networked) to other machines. In anetworked deployment, the machine 1600 may operate in the capacity of aserver machine 110 or a client machine in a server-client networkenvironment, or as a peer machine in a distributed (e.g., peer-to-peer)network environment. The machine 1600 may include hardware, software, orcombinations thereof, and may, as example, be a server computer, aclient computer, a personal computer (PC), a tablet computer, a laptopcomputer, a netbook, a cellular telephone, a smartphone, a set-top box(STB), a personal digital assistant (PDA), a web appliance, a networkrouter, a network switch, a network bridge, or any machine capable ofexecuting the instructions 1624, sequentially or otherwise, that specifyactions to be taken by that machine. Further, while only a singlemachine 1600 is illustrated, the term “machine” shall also be taken toinclude any collection of machines that individually or jointly executethe instructions 1624 to perform all or part of any one or more of themethodologies discussed herein.

The machine 1600 includes a processor 1602 (e.g., a central processingunit (CPU), a graphics processing unit (GPU), a digital signal processor(DSP), an application specific integrated circuit (ASIC), aradio-frequency integrated circuit (RFIC), or any suitable combinationthereof), a main memory 1604, and a static memory 1606, which areconfigured to communicate with each other via a bus 1608. The processor1602 may contain microcircuits that are configurable, temporarily orpermanently, by some or all of the instructions 1624 such that theprocessor 1602 is configurable to perform any one or more of themethodologies described herein, in whole or in part. For example, a setof one or more microcircuits of the processor 1602 may be configurableto execute one or more modules (e.g., software modules) describedherein.

The machine 1600 may further include a video display 1610 (e.g., aplasma display panel (PDP), a light emitting diode (LED) display, aliquid crystal display (LCD), a projector, a cathode ray tube (CRT), orany other display capable of displaying graphics or video). The machine1600 may also include an alphanumeric input device 1612 (e.g., akeyboard or keypad), a cursor control device 1614 (e.g., a mouse, atouchpad, a trackball, a joystick, a motion sensor, an eye trackingdevice, or other pointing instrument), a storage unit 1616, a signalgeneration device 1618 (e.g., a sound card, an amplifier, a speaker, aheadphone jack, or any suitable combination thereof), and a networkinterface device 1620.

The storage unit 1616 includes the machine-readable medium 1622 (e.g., atangible and non-transitory machine-readable storage medium) on whichare stored the instructions 1624 embodying any one or more of themethodologies or functions described herein, including, for example, anyof the descriptions of FIGS. 1A-15. The instructions 1624 may alsoreside, completely or at least partially, within the main memory 1604,within the processor 1602 (e.g., within the processor's cache memory),or both, before or during execution thereof by the machine 1600. Theinstructions 1624 may also reside in the static memory 1606.

Accordingly, the main memory 1604 and the processor 1602 may beconsidered machine-readable media 1622 (e.g., tangible andnon-transitory machine-readable media). The instructions 1624 may betransmitted or received over a network 1626 via the network interfacedevice 1620. For example, the network interface device 1620 maycommunicate the instructions 1624 using any one or more transferprotocols (e.g., HTTP). The machine 1600 may also represent examplemeans for performing any of the functions described herein, includingthe processes described in FIGS. 1A-15.

In some example embodiments, the machine 1600 may be a portablecomputing device, such as a smart phone or tablet computer, and have oneor more additional input components (e.g., sensors or gauges) (notshown). Examples of such input components include an image inputcomponent (e.g., one or more cameras), an audio input component (e.g., amicrophone), a direction input component (e.g., a compass), a locationinput component (e.g., a GPS receiver), an orientation component (e.g.,a gyroscope), a motion detection component (e.g., one or moreaccelerometers), an altitude detection component (e.g., an altimeter),and a gas detection component (e.g., a gas sensor). Inputs harvested byany one or more of these input components may be accessible andavailable for use by any of the modules described herein.

As used herein, the term “memory” refers to a machine-readable medium1622 able to store data temporarily or permanently and may be taken toinclude, but not be limited to, random-access memory (RAM), read-onlymemory (ROM), buffer memory, flash memory, and cache memory. While themachine-readable medium 1622 is shown in an example embodiment to be asingle medium, the term “machine-readable medium” should be taken toinclude a single medium or multiple media (e.g., a centralized ordistributed database 115, or associated caches and servers) able tostore instructions 1624. The term “machine-readable medium” shall alsobe taken to include any medium, or combination of multiple media, thatis capable of storing the instructions 1624 for execution by the machine1600, such that the instructions 1624, when executed by one or moreprocessors of the machine 1600 (e.g., processor 1602), cause the machine1600 to perform any one or more of the methodologies described herein,in whole or in part. Accordingly, a “machine-readable medium” refers toa single storage apparatus or device 120 or 130, as well as cloud-basedstorage systems or storage networks that include multiple storageapparatus or devices 120 or 130. The term “machine-readable medium”shall accordingly be taken to include, but not be limited to, one ormore tangible (e.g., non-transitory) data repositories in the form of asolid-state memory, an optical medium, a magnetic medium, or anysuitable combination thereof.

Furthermore, the machine-readable medium 1622 is non-transitory in thatit does not embody a propagating signal. However, labeling the tangiblemachine-readable medium 1622 as “non-transitory” should not be construedto mean that the medium is incapable of movement; the medium should beconsidered as being transportable from one physical location to another.Additionally, since the machine-readable medium 1622 is tangible, themedium may be considered to be a machine-readable device.

Throughout this specification, plural instances may implementcomponents, operations, or structures described as a single instance.Although individual operations of one or more methods are illustratedand described as separate operations, one or more of the individualoperations may be performed concurrently, and nothing requires that theoperations be performed in the order illustrated. Structures andfunctionality presented as separate components in example configurationsmay be implemented as a combined structure or component. Similarly,structures and functionality presented as a single component may beimplemented as separate components. These and other variations,modifications, additions, and improvements fall within the scope of thesubject matter herein.

Certain embodiments are described herein as including logic or a numberof components, modules, or mechanisms. Modules may constitute softwaremodules (e.g., code stored or otherwise embodied on a machine-readablemedium 1622 or in a transmission medium), hardware modules, or anysuitable combination thereof. A “hardware module” is a tangible (e.g.,non-transitory) unit capable of performing certain operations and may beconfigured or arranged in a certain physical manner. In various exampleembodiments, one or more computer systems (e.g., a standalone computersystem, a client computer system, or a server computer system) or one ormore hardware modules of a computer system (e.g., a processor 1602 or agroup of processors 1602) may be configured by software (e.g., anapplication or application portion) as a hardware module that operatesto perform certain operations as described herein.

In some embodiments, a hardware module may be implemented mechanically,electronically, or any suitable combination thereof. For example, ahardware module may include dedicated circuitry or logic that ispermanently configured to perform certain operations. For example, ahardware module may be a special-purpose processor, such as a fieldprogrammable gate array (FPGA) or an ASIC. A hardware module may alsoinclude programmable logic or circuitry that is temporarily configuredby software to perform certain operations. For example, a hardwaremodule may include software encompassed within a general-purposeprocessor 1602 or other programmable processor 1602. It will beappreciated that the decision to implement a hardware modulemechanically, in dedicated and permanently configured circuitry, or intemporarily configured circuitry (e.g., configured by software) may bedriven by cost and time considerations.

Hardware modules can provide information to, and receive informationfrom, other hardware modules. Accordingly, the described hardwaremodules may be regarded as being communicatively coupled. Where multiplehardware modules exist contemporaneously, communications may be achievedthrough signal transmission (e.g., over appropriate circuits and buses1608) between or among two or more of the hardware modules. Inembodiments in which multiple hardware modules are configured orinstantiated at different times, communications between such hardwaremodules may be achieved, for example, through the storage and retrievalof information in memory structures to which the multiple hardwaremodules have access. For example, one hardware module may perform anoperation and store the output of that operation in a memory device towhich it is communicatively coupled. A further hardware module may then,at a later time, access the memory device to retrieve and process thestored output. Hardware modules may also initiate communications withinput or output devices, and can operate on a resource (e.g., acollection of information).

The various operations of example methods described herein may beperformed, at least partially, by one or more processors 1602 that aretemporarily configured (e.g., by software) or permanently configured toperform the relevant operations. Whether temporarily or permanentlyconfigured, such processors 1602 may constitute processor-implementedmodules that operate to perform one or more operations or functionsdescribed herein. As used herein, “processor-implemented module” refersto a hardware module implemented using one or more processors 1602.

Similarly, the methods described herein may be at least partiallyprocessor-implemented, a processor 1602 being an example of hardware.For example, at least some of the operations of a method may beperformed by one or more processors 1602 or processor-implementedmodules. As used herein, “processor-implemented module” refers to ahardware module in which the hardware includes one or more processors1602. Moreover, the one or more processors 1602 may also operate tosupport performance of the relevant operations in a “cloud computing”environment or as a “software as a service” (SaaS). For example, atleast some of the operations may be performed by a group of computers(as examples of machines 1600 including processors 1602), with theseoperations being accessible via a network 1626 (e.g., the Internet) andvia one or more appropriate interfaces (e.g., an API).

The performance of certain operations may be distributed among the oneor more processors 1602, not only residing within a single machine 1600,but deployed across a number of machines 1600. In some exampleembodiments, the one or more processors 1602 or processor-implementedmodules may be located in a single geographic location (e.g., within ahome environment, an office environment, or a server farm). In otherexample embodiments, the one or more processors 1602 orprocessor-implemented modules may be distributed across a number ofgeographic locations.

Unless specifically stated otherwise, discussions herein using wordssuch as “processing,” “computing,” “calculating,” “determining,”“presenting,” “displaying,” or the like may refer to actions orprocesses of a machine 1600 (e.g., a computer) that manipulates ortransforms data represented as physical (e.g., electronic, magnetic, oroptical) quantities within one or more memories (e.g., volatile memory,non-volatile memory, or any suitable combination thereof), registers, orother machine components that receive, store, transmit, or displayinformation. Furthermore, unless specifically stated otherwise, theterms “a” or “an” are herein used, as is common in patent documents, toinclude one or more than one instance. Finally, as used herein, theconjunction “or” refers to a non-exclusive “or,” unless specificallystated otherwise.

The present disclosure is illustrative and not limiting. Furthermodifications will be apparent to one skilled in the art in light ofthis disclosure and are intended to fall within the scope of theappended claims.

What is claimed is:
 1. A system for generating a natural language model,the system comprising: at least one memory and at least one processorcommunicatively coupled to the at least one memory; and a databasemodule, an application program interface (API) module, a backgroundprocessing module, and an applications module, each stored on the atleast one memory and executable by the at least one processor; the APImodule configured to ingest training data representative of documents tobe analyzed by the natural language model and to store the training datain the database module; the background processing module configured to:generate a hierarchical data structure, the hierarchical data structurecomprising at least two topical nodes, wherein the at least two topicalnodes represent partitions organized by two or more topical themes amongthe topical content of the training data within which the training datais to be subdivided into; select among the training data a plurality ofdocuments to be annotated; generate at least one annotation prompt foreach document among the plurality of documents to be annotated, saidannotation prompt configured to elicit an annotation about said documentindicating which node among the at least two topical nodes of thehierarchal data structure said document is to be classified into; theapplication module configured to cause display of the at least oneannotation prompt for each document among the plurality of documents tobe annotated; the API module further configured to receive for eachdocument among the plurality of documents to be annotated, theannotation in response to the displayed annotation prompt; and thebackground processing module further configured to generate the naturallanguage model using an adaptive machine learning process configured todetermine, among the received annotations, patterns for how thedocuments in the training data are to be subdivided according to the atleast two topical nodes of the hierarchical data structure.
 2. Thesystem of claim 1, wherein the background processing module is furtherconfigured to test performance of the natural language model using asubset of the documents among the training data that receivedannotations.
 3. The system of claim 2, wherein the background processingmodule is further configured to: compute a performance metric of thenatural language model, based on results of the testing; and determinewhether the natural language model satisfies at least one performancecriterion based on the computed performance metric.
 4. The system ofclaim 3, wherein the background processing module is further configuredto: performing one or more optimization techniques configured to improveperformance of the natural language platform, in response to determiningthat the natural language platform fails to satisfy the at least oneperformance criterion based on the computed performance metric.
 5. Thesystem of claim 4, wherein the one or more optimization techniquescomprises at least one of: a feature selection process, a padding andrebalancing process of the natural language model, a pruning process ofthe natural language model, a feature discovery process, a smoothingprocess of the natural language model, or a model interpolation process.6. The system of claim 3, wherein the background processing module isfurther configured to: determine that the natural language platformfails to satisfy the at least one performance criterion based on thecomputed performance metric; in response to said determining: identify atopical node among the two or more topical nodes of the hierarchal datastructure that the natural language model fails to accurately categorizedocuments into; select a second plurality of documents to be annotated,the second plurality comprising documents associated with said topicalnode that the natural language model failed to accurately categorizedocuments into; and generate a second set of at least one annotationprompt for each document among the second plurality of documents to beannotated, said annotation prompt among the second set configured toelicit an annotation about said document to improve the natural languagemodel in accurately categorizing documents into said topical node;wherein the applications module is further configured to cause displayof the second set of the at least one annotation prompt for eachdocument among the second plurality of documents to be annotated;wherein the API module is further configured to receive for eachdocument among the second plurality of documents to be annotated, asecond set of annotations in response to the second set of displayedannotation prompts; and wherein the background processing module isfurther configured to generate a refined natural language model usingthe adaptive machine learning process and based on the hierarchical datastructure, the training data and the second set of annotations.
 7. Thesystem of claim 1, wherein generating the hierarchical data structurecomprises: performing a topic modeling process configured to identifytwo or more topics among the content of the training data that isconfigured to define the two or more topical nodes of the hierarchicaldata structure.
 8. The system of claim 1, wherein the backgroundprocessing module is further configured to access one or more rulesconfigured to instruct the natural language model how to categorize oneor more documents into the two or more nodes of the hierarchical datastructure.
 9. The system of claim 8, wherein generating the hierarchicaldata structure comprises: conducting a rules generation processconfigured to evaluate logical consistency among the one or more rules.10. The system of claim 1, wherein generating the hierarchical datastructure comprises: generating at least one annotation prompt for eachtopical node among the two or more topical nodes in the hierarchicaldata structure, said annotation prompt configured to elicit anannotation about said topical node indicating a level of accuracy ofplacement of the node within the hierarchical data structure; causingdisplay of the at least one annotation prompt for each topical node; andreceiving for each topical node, the annotation in response to thedisplayed annotation prompt.
 11. The system of claim 10, wherein thebackground processing module is further configured to evaluateperformance of the hierarchical data structure based on the annotations.12. The system of claim 11, wherein the background processing module isfurther configured to determine that the hierarchical data structurefails to satisfy at least one performance criterion in response to theevaluating; and modify a logical relationship among the two or moretopical nodes based on the annotations and in response to determiningthat the data structure fails to satisfy the at least one performancecriterion.
 13. The system of claim 9, wherein the API module is furtherconfigured to receive a training guideline based on the annotations tothe nodes, the training guideline configured to provide instructions toan annotator for answering one or more annotation prompts for eachdocument among the plurality of documents to be annotated.
 14. Thesystem of claim 1, wherein the hierarchical data structure comprises atleast a third topical node and a fourth topical node, wherein the thirdand fourth topical nodes both represent sub-partitions within thetopical theme of the first node and organized by a third and fourthtopical theme, respectively, among the topical content of the trainingdata within which the training data is to be subdivided into.
 15. Amethod for generating a natural language platform system configured togenerate a natural language model, the method comprising: deriving froman analogous natural language platform system, parameters configured tooptimize performance of said analogous system to generate an analogousnatural language model configured to analyze similar but not identicaldocuments as the natural language model; interpolating said parametersto be optimized for documents to be analyzed by the natural languagemodel; and implementing the interpolated parameters in the naturallanguage platform system such that the interpolated parameters areconfigured to generate the natural language model.
 16. A system forgenerating natural language models, the system comprising: a fullservice natural language platform geographically located at a remotehost location and configured to: receive training data through a networkconnection; train a natural language model based on the receivedtraining data; and generate predictions about untested data using thenatural language model; and a connector module geographically located ata local client host location and communicatively coupled to the fullservice natural language platform through the network connection andconfigured to: access the training data stored in a client data store atthe local client host location; format the training data in a uniformmanner; transmit the training data to the full service natural languageplatform through the network connection; receive the predictions aboutthe untested data; and store the predictions about the untested data ina memory at the local client host location.
 17. The system of claim 16,further comprising a text extraction module communicatively coupled tothe connector module and configured to attach to the client data storeand extract textual data for use as the training data.
 18. The system ofclaim 16, wherein the full service natural language platform is furtherconfigured to: generate a hierarchical data structure, the hierarchicaldata structure comprising at least two topical nodes, wherein the atleast two topical nodes represent partitions organized by two or moretopical themes among the topical content of the training data withinwhich the training data is to be subdivided into; select among thetraining data a plurality of documents to be annotated; and generate atleast one annotation prompt for each document among the plurality ofdocuments to be annotated, said annotation prompt configured to elicitan annotation about said document indicating which node among the atleast two topical nodes of the hierarchal data structure said documentis to be classified into.
 19. The system of claim 18, wherein the fullservice natural language platform is further configured to: causedisplay of the at least one annotation prompt for each document amongthe plurality of documents to be annotated; receive for each documentamong the plurality of documents to be annotated, the annotation inresponse to the displayed annotation prompt; and generate the naturallanguage model using an adaptive machine learning process configured todetermine, among the received annotations, patterns for how thedocuments in the training data are to be subdivided according to the atleast two topical nodes of the hierarchical data structure.
 20. Thesystem of claim 19, wherein the full service natural language platformis further configured to: test performance of the natural language modelusing a subset of the documents among the training data that receivedannotations; compute a performance metric of the natural language model,based on results of the testing; determine that the natural languageplatform fails to satisfy the at least one performance criterion basedon the computed performance metric; in response to said determining:identify a topical node among the two or more topical nodes of thehierarchal data structure that the natural language model fails toaccurately categorize documents into; select a second plurality ofdocuments to be annotated, the second plurality comprising documentsassociated with said topical node that the natural language model failedto accurately categorize documents into; generate a second set of atleast one annotation prompt for each document among the second pluralityof documents to be annotated, said annotation prompt among the secondset configured to elicit an annotation about said document to improvethe natural language model in accurately categorizing documents intosaid topical node; cause display of the second set of the at least oneannotation prompt for each document among the second plurality ofdocuments to be annotated; receive for each document among the secondplurality of documents to be annotated, a second set of annotations inresponse to the second set of displayed annotation prompts; and generatea refined natural language model using the adaptive machine learningprocess and based on the hierarchical data structure, the training dataand the second set of annotations.